HPAVC

home *** CD-ROM | disk | FTP | other *** search

/ HPAVC / HPAVC CD-ROM.iso / pc / PENTOPT.ZIP / PENTOPT.TXT < prev

Wrap

Text File | 1997-02-28 | 139.6 KB | 3,096 lines

HOW TO OPTIMIZE FOR THE PENTIUM PROCESSOR ***************************************** Copyright (c) 1996, 1997 by Agner Fog. Last modified 1997-02-28. Contents: 1. introduction 2. literature 3. debugging and verifying 4. memory model 5. alignment 6. cache 7. address generation interlock 8. pairing integer instructions 9. first time vs. repeated execution 10. imperfect pairing 11. splitting complex instructions into simpler ones 12. jumps and branches 13. prefixes 14. reducing code size 15. scheduling floating point code 16. loops 17. discussion of special instructions 18. using integer instructions to do floating point operations 19. using floating point instructions to do integer operations 20. list of integer instructions 21. list of floating point instructions 22. testing code speed 23. considerations for other microprocessors 1. INTRODUCTION =============== This manual describes in detail how to write optimized assembly language code, with particular focus on the Pentium(R) microprocessor. This manual is more comprehensive and exact than other sources of information and contains many details not found anywhere else. This information will enable you in many cases to calculate exactly how many clock cycles a piece of code will take. The information in this manual is based on my own research. This research has so far covered only the Pentium Processor without MMX. The Pentium with MMX behaves almost the same. Differences are mentioned where appropriate. This manual will be updated later with more information on the Pentium with MMX. The PentiumPro and PentiumII processors are very different, and will be covered only briefly. I have no plans of doing research on the PentiumPro and PentiumII processors. Programming in assembly language is much more difficult than high level language. Making bugs is very easy, and finding them is very difficult. Now you have been warned! It is assumed that the reader is already experienced in assembly programming. If not, then please read some books on the subject and get some programming experience before you begin to do complicated optimations. The hardware design of the Pentium chip has many features which are optimized specifically for some commonly used instructions or instruction combinations, rather than using general optimation methods. Consequently, the rules for optimizing software for this design are quite complicated and have many exceptions, but the possible gain in performance may be substantial. Before you start to convert your code to assembly, make sure that your algorithm is optimal. Often you can improve a piece of code much more by improving the algorithm than by converting it to assembler code. Some high level language compilers offer relatively good optimation for the Pentium, and further optimation by hand may not be worth the effort except for the most critical pieces of code - or for the sport of it. Please don't send your programming questions to me. I am not gonna do your homework for you. Good luck with your hunt for nanoseconds! PS. I have deliberately misspelled 'optimization' because I think 'optimation' sounds better. 2. LITERATURE ============= A lot of useful literature can be downloaded for free from Intel's www site or acquired in print or on CD-ROM. I will not give the URL's here because they change very often. You can find the documents you need by using the search facilities at: http://www.intel.com/sites/developer/search.htm or follow the links from http://announce.com/agner/assem Some documents are in .PDF format. If you don't have software for viewing or printing .PDF files, then you may download the Acrobat file reader from www.adobe.com The most useful document is Intel's application note: 'AP-526 Optimizations for Intel's 32-bit Processors' Beginners may use a funny tutorial named "Optimizing Applications for the Pentium Processor" Optimations for the PentiumPro and PentiumII processors are described in 'AP-526 Optimizations for Intel's 32-bit Processors' and in the tutorial 'Optimizing for the Pentium« Pro Processor Family'. The use of MMX instructions for optimizing specific applications are described in several application notes. The MMX instruction set is described in various manuals and tutorials. A lot of other sources than Intel also have useful information. These sources are listed in the FAQ for the newsgroup comp.lang.asm.x86. The shareware editor ASMEDIT has an online help which covers all instruction codes etc. It is available from: http://www.inf.tu-dresden.de/~ok3/asmedit.html For other internet ressources follow the links from http://announce.com/agner/assem/ 3. DEBUGGING AND VERIFYING ========================== Debugging assembly code can be quite hard and frustrating, as you probably already have discovered. I would recommend that you start with writing the piece of code you want to optimize as a subroutine in a high level language. Next, write a test program that will test your subroutine thoroughly. Make sure the test program goes into all branches and special cases. When your high level language subroutine works with your test program, then you are ready to translate the code to assembly language (some high level compilers will do this for you), and test it on the test program. Then you can start to optimize. Each time you have made a modification you should run it on the test program to see if it works correctly. Number all your versions and save them so that you can go back and test them again in case you discover an error which the test program didn't catch (such as writing to a wrong address). Test the speed of the most critical part of your program with the method described in chapter 22. If the code is significantly slower than expected, then the most probable reasons are: cache misses (chapter 6), misaligned operands (chapter 5), and first time penalty (chapter 9). 4. MEMORY MODEL =============== The Pentium is designed primarily for 32 bit code, and it's performance is inferior on 16 bit code. Segmenting your code and data also degrades performance significantly, so you should generally prefer 32 bit flat mode, and an operating system which supports this mode (e.g. Windows 95, OS/2, or a 32 bit DOS extender). The code examples shown in this manual assume a 32 bit flat memory model, unless otherwise specified. 5. ALIGNMENT ============ All data in RAM should be aligned to addresses divisible by 2, 4, or 8 according to this scheme: operand size alignment --------------------------- 1 (byte) 1 2 (word) 2 (or address MOD 4 >< 3. other proc. require align by 2) 4 (dword) 4 6 (fword) 4 (Pentium Pro requires alignment by 8) 8 (qword) 8 10 (tbyte) 8 (Pentium Pro requires alignment by 16) Misaligned data will take at least 3 clock cycles extra to access. Aligning code is not necessary when you run on the Pentium, but for optimal performance on other processors you may align subroutine entries and loop entries by 16. 6. CACHE ======== The Pentium without MMX has 8 kb of on-chip cache (level one cache) for code, and 8 kb for data. The Pentium with MMX has 16 kb for code and 16 kb for data. Data in the level 1 cache can be read or written to in just one clock cycle, whereas a cache miss may cost many clock cycles. It is therefore important that you understand how the cache works in order to use it most efficiently. The data cache in the Pentium without MMX consists of 256 lines of 32 bytes each. Each time you read a data item which is not cached, the processor will read an entire cache line from memory. The cache lines are always aligned to a physical address divisible by 32. When you have read a byte at an address divisible by 32, then the next 31 bytes can be read or written to at almost no extra cost. You can take advantage of this by arranging data items which are used near each other together into aligned blocks of 32 bytes of memory. If, for example, you have a loop which accesses two arrays, then you may combine the two arrays into one array of structures, so that data which are used together are also stored together. If the size of an array or other data structure is a multiple of 32 bytes, then you should preferably align it by 32. A cache line can not be assigned to an arbitrary memory address. Each cache line has a 7-bit set-value which must match bit 5 through 11 of the physical RAM address (bit 0-4 define the 32 bytes within a cache line). There are two cache lines for each of the 128 set-values, so there are two possible cache lines to assign to any RAM address (four for the MMX version). The consequence of this is that the cache can hold no more than two different data blocks which have the same value in bits 5-11 of the address. You can determine if two addresses have the same set-value by the following method: Strip off the lower 5 bits of each address to get a value divisible by 32. If the difference between the two truncated addresses is a multiple of 4096 (=1000H), then the addresses have the same set-value. Let me illustrate this by the following piece of code, where ESI holds an address divisible by 32: AGAIN: MOV EAX, [ESI] MOV EBX, [ESI + 13*4096 + 4] MOV ECX, [ESI + 20*4096 + 28] DEC EDX JNZ AGAIN The three addresses used here all have the same set-value because the differences between the truncated addresses are multipla of 4096. This loop will perform very poorly. At the time you read ECX there is no free cache line with the proper set-value so the processor takes the least recently used of the two possible cache lines, that is the one which was used for EAX, and fills it with the data from [ESI + 20*4096] to [ESI + 20*4096 + 31] and reads ECX. Next, when reading EAX, you find that the cache line that held the value for EAX has now been discarded, so you take the least recently used line, which is the one holding the EBX value, and so on.. You have nothing but cache misses and the loop takes something like 60 clock cycles. If the third line is changed to: MOV ECX, [ESI + 20*4096 + 32] then we have crossed a 32 byte boundary, so that we do not have the same set-value as in the first two lines, and there will be no problem assigning a cache line to each of the three addresses. The loop now takes only 3 clock cycles (except for the first time) - a very considerable improvement! It may be very difficult to determine if your data addresses have the same set-values, especially if they are scattered around in different segments. The best thing you can do to avoid problems of this kind is to keep all data used in the critical part or your program within one contiguous block of max. 8 kbytes or two contiguous blocks of max. 4 kbytes each (for example one block for static data and another block for data on the stack). This will make sure that your cache lines are used optimally. If the critical part of your code accesses big data structures or random data addresses, then you may want to keep all frequently used variables (counters, pointers, control variables, etc.) within a single contiguous block of max 4 kbytes so that you have a complete set of cache lines free for accessing random data. Since you probably need stack space anyway for subroutine parameters and return addresses, the best thing is to copy all frequently used static data to dynamic variables on the stack, and copy them back again outside the critical loop if they have been changed. Reading a data item which is not in the level one cache causes an entire cache line to be filled from the level two cache, which takes approximately 200 ns (that is 20 clocks on a 100 MHz system or 40 clocks on a 200 MHz system), but the bytes you ask for first are available already after 50-100 ns. If the data item is not in the level two cache either, then you will get a delay of something like 200-300 ns. This delay will be somewhat longer if you cross a DRAM page boundary. (The size of a DRAM page is 1 kb for 4 MB 72 pin RAM modules, and 2 kb for 16 MB modules). When you write to an address which is not in the level 1 cache, then the value will go right through to the level 2 cache or to the RAM (depending on how the level 2 cache is set up). This takes approximately 100 ns. If you write eight or more times to the same 32 byte block of memory without also reading from it, and the block is not in the level one cache, then it may be advantageous to make a dummy read from the block first to load it into a cache line. All subsequent writes to the same block will then go to the cache instead, which takes only one clock cycle. (This is not needed on a PentiumPro, which always loads a cache line on write misses). There is sometimes a small penalty for writing repeatedly to the same address without reading in between. Another way to reduce the number of write misses is to write 8 bytes at a time using FILD / FISTP with qword operands rather than writing 4 bytes at a time using integer registers. The extra cost of the slow FILD and FISTP instructions is compensated for by the fact that you only have to do half as many writes. However, this method is only advantageous on a Pentium, and only if the destination is not in the level 1 cache. The method is explained in section 19 below. Temporary data may conveniently be stored at the stack because stack data are very likely to be in the cache. You should be aware, however, that if you store QWORD data on a DWORD size stack, or DWORD data on a WORD size stack, then you have a potential alignment problem. If the life ranges of two data structures do not overlap, then they may use the same RAM area to increase cache efficiency. This is consistent with the common practice of allocating space for temporary variables on the stack. Storing temporary data in registers is of course even more efficient. Since registers is a scarce ressource, you may want to use [ESP] rather than [EBP] for addressing data on the stack, in order to free EBP for other purposes. Just don't forget that the value of ESP changes every time you do a PUSH or POP. (You cannot use ESP under 16-bit Windows because the timer interrupt will modify the high word of ESP at unpredictable places in your code.) The Pentium has a separate 8 kb cache for code, which is similar to the data cache. It is important that the critical part of your code (the innermost loops) fit in the code cache. Frequently used pieces of code or routines which are used together should preferable be stored near each other. Seldom used branches or procedures should go to the bottom of your code. 7. ADDRESS GENERATION INTERLOCK (AGI) ===================================== It takes one clock cycle to calculate the address needed by an instruction which accesses memory. Normally, this calculation is done at a separate stage in the pipeline while the preceding instruction or instruction pair is executing. But if the address depends on the result of an instruction executing in the preceding clock cycle, then you have to wait an extra clock cycle for the address to be calculated. This is called an AGI stall. Example: ADD EBX,4 / MOV EAX,[EBX] ; AGI stall The stall in this example can be removed by putting some other instructions in between ADD EBX,4 and MOV EAX,[EBX] or by rewriting the code to: MOV EAX,[EBX+4] / ADD EBX,4 You can also get an AGI stall with instructions which use ESP implicitly for addressing, such as PUSH, POP, CALL, and RET, if ESP has been changed in the preceding clock cycle by instructions such as MOV, ADD, or SUB. The Pentium has special circuitry to predict the value of ESP after a stack operation so that you do not get an AGI delay after changing ESP with PUSH, POP, or CALL. You can get an AGI stall after RET only if it has an immediate operand to add to ESP. Examples: ADD ESP,4 / POP ESI ; AGI stall POP EAX / POP ESI ; no stall, pair MOV ESP,EBP / RET ; AGI stall CALL L1 / L1: MOV EAX,[ESP+8] ; no stall RET / POP EAX ; no stall RET 8 / POP EAX ; AGI stall The LEA instruction is also subject to an AGI stall if it uses a base or index register which has been changed in the preceding clock cycle. Example: INC ESI / LEA EAX,[EBX+4*ESI] ; AGI stall 8. PAIRING INTEGER INSTRUCTIONS =============================== The Pentium has two pipelines for executing instructions, called the U-pipe and the V-pipe. Under certain conditions it is possible to execute two instructions simultaneously, one in the U-pipe and one in the V-pipe. This can almost double the speed. It is therefore advantageous to reorder your instructions to make them pair. The following instructions are pairable in either pipe: MOV register, memory, or immediate into register or memory PUSH register or immediate POP register LEA, NOP INC, DEC, ADD, SUB, CMP, AND, OR, XOR, and some forms of TEST (see section 17.3) The following instructions are pairable in the U-pipe only: ADC, SBB SHR, SAR, SHL, SAL with immediate count ROR, ROL, RCR, RCL with an immediate count of 1 The following instructions can execute in either pipe but are only pairable when in the V-pipe: near call, short and near jump, short and near conditional jump. All other integer instructions can execute in the U-pipe only, and are not pairable. Two consecutive instructions will pair when the following conditions are met: 8.1 The first instruction is pairable in the U-pipe and the second instruction is pairable in the V-pipe. 8.2 The second instruction cannot read or write a register which the first instruction writes to. Examples: MOV EAX, EBX / MOV ECX, EAX ; read after write, do not pair MOV EAX, 1 / MOV EAX, 2 ; write after write, do not pair MOV EBX, EAX / MOV EAX, 2 ; write after read, pair OK MOV EBX, EAX / MOV ECX, EAX ; read after read, pair OK MOV EBX, EAX / INC EAX ; read and write after read, pair OK 8.3 In rule 8.2 partial registers are treated as full registers. Example: MOV AL, BL / MOV AH, 0 ; writes to different parts of the same ; register, do not pair 8.4 Two instructions which both write to parts of the flags register can pair despite rule 8.2 and 8.3. Example: SHR EAX,4 / INC EBX ; pair OK 8.5 An instruction which writes to the flags can pair with a conditional jump despite rule 8.2. Example: CMP EAX, 2 / JA LabelBigger ; pair OK 8.6 The following instruction combinations can pair despite the fact that they both modify the stack pointer: PUSH + PUSH, PUSH + CALL, POP + POP 8.7 There are restrictions on the pairing of instructions with prefix. There are several types of prefixes: - instructions addressing a non-default segment have a segment prefix. - instructions using 16 bit data in 32 bit mode, or 32 bit data in 16 bit mode have an operand size prefix. - instructions using 32 bit base or index registers in 16 bit mode have an address size prefix. - repeated string instructions have a repeat prefix. - locked instructions have a LOCK prefix. - many instructions which were not implemented on the 8086 processor have a two byte opcode where the first byte is 0FH. The 0FH byte behaves as a prefix on the Pentium without MMX, but not on the Pentium with MMX. The most common instructions with 0FH prefix are: MOVZX, MOVSX, PUSH FS, POP FS, PUSH GS, POP GS, LFS, LGS, LSS, SETcc, BT, BTC, BTR, BTS, BSF, BSR, SHLD, SHRD, and IMUL with two operands and no immediate operand. On the Pentium without MMX, a prefixed instruction can only execute in the U-pipe, except for conditional near jumps. On the Pentium with MMX, instructions with operand size, address size, or 0FH prefix can execute in either pipe, whereas instructions with segment, repeat, or lock prefix can only execute in the U-pipe. 8.8 An instruction which has both a displacement and immediate data is not pairable on the Pentium without MMX and only pairable in the U-pipe on Pentium with MMX: MOV DWORD PTR DS:[1000], 0 ; not pairable or only in U-pipe CMP BYTE PTR [EBX+8], 1 ; not pairable or only in U-pipe CMP BYTE PTR [EBX], 1 ; pairable CMP BYTE PTR [EBX+8], AL ; pairable (Another problem with instructions which have both a displacement and immediate data on the Pentium with MMX is that such instructions may be longer than 7 bytes, which means that only one instruction can be decoded per clock cycle, as explained in section 13.) 8.9 Both instructions must be preloaded and decoded. This is explained in section 9. 8.10 There are special pairing rules for MMX instructions: - MMX shift instructions can execute in either pipe but cannot pair with other MMX shift instructions - MMX multiply instructions can execute in either pipe but cannot pair with other MMX multiply instructions. They take 3 clock cycles and can be pipelined to a throughput of one multiplication per clock cycle in the same way as floating point instructions (see chapter 15). - an MMX instruction which accesses memory or integer registers can execute only in the U-pipe and cannot pair with a non-MMX instruction. for further details see the Intel manuals. 9. FIRST TIME VERSUS REPEATED EXECUTION ======================================= A piece of code usually takes much more time the first time it is executed than when it is repeated. The reasons are the following: 9.1 Loading the code from RAM into the cache takes longer time than executing it. 9.2 In some cases decoding the code is the bottleneck. If it takes one clock cycle to determine the length of an instruction, then it is not possible to decode two instructions per clock cycle, because the processor doesn't know where the second instruction begins. The Pentium solves this problem by remembering the length of any instruction which has remained in the cache since last time it was executed. As a consequence of this, a set of instructions will not pair the first time they are executed, unless the first of the two instructions is only one byte long. 9.3 Jump instructions will not be in the branch target buffer the first time they execute, and therefore are less likely to be predicted correctly. See chapter 12. 9.4 Any data accessed by the code has to be loaded into the cache, which may take much more time than executing the instructions. When the code is repeated, then the data are more likely to be in the cache. For these four reasons, a piece of code inside a loop will generally take much more time the first time it executes than the subsequent times. If you have a big loop which doesn't fit into the code cache then you will get the penalty of 9.1 and 9.2 all the time because it doesn't run from the cache. You should therefore try to reorganize the loop to make it fit into the cache. If you have many jumps and branches inside a loop, then you may get the penalty in 9.3 all the time. Likewise, if a loop repeatedly accesses a data structure too big for the data cache, then you will get the penalty of data cache misses all the time. 10. IMPERFECT PAIRING ===================== There are situations where the two instructions in a pair will not execute simultaneously, or only partially overlap in time. They should still be considered a pair, though, because the first instruction executes in the U-pipe, and the second in the V-pipe. No subsequent instruction can start to execute before both instructions in the imperfect pair have finished. Imperfect pairing will happen in the following cases: 10.1 If the second instructions suffers an AGI stall (see chapter 7). 10.2 Two instructions cannot access the same dword of memory simultaneously. The following examples assume that ESI is divisible by 4: MOV AL, [ESI] / MOV BL, [ESI+1] The two operands are within the same dword, so they cannot execute simultaneously. The pair takes 2 clock cycles. MOV AL, [ESI+3] / MOV BL, [ESI+4] Here the two operands are on each side of a dword boundary, so they pair perfectly, and take only one clock cycle. 10.3 Rule 10.2 is extended to the case where bit 2-4 is the same in the two addresses (cache bank conflict). For dword addresses this means that the difference between the two addresses should not be divisible by 32. Examples: MOV [ESI], EAX / MOV [ESI+32000], EBX ; imperfect pairing MOV [ESI], EAX / MOV [ESI+32004], EBX ; perfect pairing Pairable instructions which do not access memory take one clock cycle to execute, except for mispredicted jumps. MOV instructions to or from memory also take only one clock cycle if the data area is in the cache and properly aligned. There is no speed penalty for using complex addressing modes such as scaled index registers. A pairable instruction which reads from memory, does some calculation, and stores the result in a register or flags, takes 2 clock cycles. (read/modify instructions). A pairable instruction which reads from memory, does some calculation, and writes the result back to the memory, takes 3 clock cycles. (read/modify/write instructions). 10.4 If a read/modify/write instruction is paired with a read/modify or read/modify/write instruction, then they will pair imperfectly. The number of clock cycles used is given in the following table: | First instruction | MOV or read/ read/modify/ Second instruction | register only modify write ----------------------|---------------------------------------------- MOV or register only | 1 2 3 read/modify | 2 2 4 read/modify/write | 3 3 5 ----------------------|----------------------------------------------- Example: ADD [mem1], EAX / ADD EBX, [mem2] ; 4 clock cycles ADD EBX, [mem2] / ADD [mem1], EAX ; 3 clock cycles 10.5 When two paired instructions both take extra time due to cache misses, misalignment, or jump misprediction, then the pair will take more time than each instruction, but less than the sum of the two. In order to avoid imperfect pairing you have to know which instructions go into the U-pipe, and which to the V-pipe. You can find out this by looking backwards in your code and search for instructions which are unpairable, pairable only in one of the pipes, or cannot pair due to one of the rules in chapter 8. Imperfect pairing can often be avoided by reordering your instructions. Example: L1: MOV EAX,[ESI] MOV EBX,[ESI] INC ECX Here the two MOV instructions form an imperfect pair because they both access the same memory location, and the sequence takes 3 clock cycles. You can improve the code by reordering the instructions so that INC ECX pairs with one of the MOV instructions. L2: MOV EAX,OFFSET [A] XOR EBX,EBX INC EBX MOV ECX,[EAX] JMP L1 The pair INC EBX / MOV ECX,[EAX] is imperfect because the latter instruction has an AGI stall. The sequence takes 4 clocks. If you insert a NOP or any other instruction so that MOV ECX,[EAX] pairs with JMP L1 instead, then the sequence takes only 3 clocks. The next example is in 16 bit mode, assuming that SP is divisible by 4: L3: PUSH AX PUSH BX PUSH CX PUSH DX CALL FUNC Here the PUSH instructions form two imperfect pairs, because both operands in each pair go into the same dword of memory. PUSH BX could possibly pair perfectly with PUSH CX (because they go on each side of a dword boundary) but it doesn't because it has already been paired with PUSH AX. The sequence therefore takes 5 clocks. If you insert a NOP or any other instruction so that PUSH BX pairs with PUSH CX, and PUSH DX with CALL FUNC, then the sequence will take only 3 clocks. Another way to solve the problem is to make sure that SP is not divisible by 4. Knowing whether SP is divisible by 4 or not in 16 bit mode can be difficult, so the best way to avoid this problem is to use 32 bit mode. 11. SPLITTING COMPLEX INSTRUCTIONS INTO SIMPLER ONES ==================================================== You may split up read/modify and read/modify/write instructions to improve pairing. Example: ADD [mem1],EAX / ADD [mem2],EBX ; 5 clock cycles This code may be split up into a sequence which takes only 3 clock cycles: MOV ECX,[mem1] / MOV EDX,[mem2] / ADD ECX,EAX / ADD EDX,EBX MOV [mem1],ECX / MOV [mem2],EDX Likewise you may split up non-pairable instructions into pairable instructions: PUSH [mem1] / PUSH [mem2] ; non-pairable Split up into: MOV EAX,[mem1] / MOV EBX,[mem2] / PUSH EAX / PUSH EBX ; everything pairs Other examples of non-pairable instructions which may be split up into simpler pairable instructions: CDQ split into: MOV EDX,EAX / SAR EDX,31 NOT EAX change to XOR EAX,-1 NEG EAX split into XOR EAX,-1 / INC EAX MOVZX EAX,BYTE PTR [mem] split into XOR EAX,EAX / MOV AL,BYTE PTR [mem] JECXZ split into TEST ECX,ECX / JZ LOOP split into DEC ECX / JNZ XLAT change to MOV AL,[EBX+EAX] If splitting instructions doesn't improve speed, then you may keep the complex or nonpairable instructions in order to reduce code size. 12. JUMPS AND BRANCHES ====================== Note: This chapter describes in detail the branch prediction mechanism of the Pentium without MMX. This information is based on research done by me and Karki Jitendra Bahadur. Information found in Intel documents and elsewhere on this subject is directly misleading, and following the advises given is such documents is likely to lead to sub-optimal code. In the following, I will use the term 'jump instruction' for any instruction which can change the instruction pointer, including conditional and unconditional, direct and indirect, near and far, jumps, calls, and returns. The Pentium attempts to predict where a jump will go to, and whether a conditional jump will be taken or fall through. If the prediction is correct, then it can save time by loading the subsequent instructions into the pipeline before the jump is executed. If the prediction turns out to be wrong, then the pipeline has to be flushed, which will cost a penalty of 3 to 5 clock cycles. (4 for a conditional jump in the V-pipe, 3 in all other cases for the Pentium without MMX. The MMX version has penalties of 5 resp. 4 clocks.) When optimizing code, it is important to minimize the number of misprediction penalties. This requires a good understanding of how the jump prediction works. I am giving a very detailed description of this topic here, because this information is not published anywhere else. Note that the following description applies only to the Pentium without MMX. Branch prediction in later processors will only be covered briefly. Branch prediction ----------------- The Pentium uses a 'branch target buffer' (BTB), which can save information for up to 256 jump instructions. The BTB is organized like a 4-way set- associative cache with 64 entries per way. This means that the BTB can hold no more than 4 entries with the same set value. Unlike the data cache, the BTB uses a pseudo random replacement algorithm, which means that a new entry will not necessarily displace the least recently used entry of the same set- value. How the set-value is calculated will be explained later. Each BTB entry stores the address of the jump target and a prediction state, which can have four different values: state 0: "strongly not taken" state 1: "weakly not taken" state 2: "weakly taken" state 3: "strongly taken" A branch instruction is predicted to jump when in state 2 or 3, and to fall through when in state 0 or 1. Some documents do - not quite correctly - describe the state transition so that the state is incremented when the branch is taken, and decremented when it falls through (no wrap around). This would provide quite a good prediction, because a branch instruction would have to deviate twice from what it does most of the time, before the prediction changes. However, this mechanism has been compromised by the fact that state 0 also means 'unused BTB entry'. So a BTB entry in state 0 is the same as no BTB entry. This makes sense, because a branch instruction is predicted to fall through if it has no BTB entry. This improves the utilization of the BTB, because a branch instruction which is seldom taken will most of the time not take up any BTB entry. Now, if a jumping instruction has no BTB entry, then a new BTB entry will be generated, and this new entry will always be set to state 3. This means that it is impossible to go from state 0 to state 1 (except for a very special case discussed later). From state 0 you can only go to state 3, if the branch is taken. If the branch falls through, then it will stay out of the BTB. This is a serious design flaw. By always setting new entries to state 3, the designers apparently have given priority to minimizing the first time penalty for unconditional jumps and branches often taken, rather than improving the prediction of conditional jumps in tight innermost loops, which is almost the only place where speed really matters. The consequence of this flaw is, that a branch instruction which falls through most of the time will have up to three times as many mispredictions as a branch instruction which is taken most of the time. You may take this asymmetry into account by organizing your branches so that they are taken more often than not. Consider for example this if-then-else construction: TEST EAX,EAX JZ A <branch 1> JMP E A: <branch 2> E: If branch 1 is executed more often than branch 2, and branch 2 is seldom executed twice in succession, then you can reduce the number of branch mispredictions by up to a factor 3 by swapping the two branches so that the branch instruction will jump more often than fall through: TEST EAX,EAX JNZ A <branch 2> JMP E A: <branch 1> E: (This is _contrary_ to the recommendations in Intel's tutorial. They don't seem to recognize the asymmetry in branch prediction. But it is easy to prove that the latter example is superior). There may be reasons to put the most often executed branch first, however: 1. Putting seldom executed branches away in the bottom of your code can improve code cache utilization. 2. A branch instruction seldom taken will stay out of the BTB most of the time, possibly improving BTB utilization. 3. The branch instruction will be predicted as not taken if it has been flushed out of the BTB by other branch instructions. 4. The asymmetry in branch prediction only exists on the Pentium without MMX (?) These considerations have little weight, however, for small critical loops, so I would still recommend organizing branches with a skewed distribution so that the branch instruction is taken more often than not, unless branch 2 is executed so seldom, that misprediction doesn't matter. Likewise, you should preferably organize loops with the testing branch at the bottom, as in this example: MOV ECX, [N] L: MOV [EDI],EAX ADD EDI,4 DEC ECX JNZ L If N is high, then the JNZ instruction here will be taken more often than not, and never fall through twice in succession. BTB is looking ahead -------------------- The BTB mechanism is counting instruction pairs, rather than single instructions, so you have to know how instructions are pairing in order to analyze where a BTB entry is stored. The BTB entry for any jump instruction is attached to the address of the U-pipe instruction in the preceding instruction pair. (An unpaired instruction counts as one pair). Example: SHR EAX,1 MOV EBX,[ESI] CMP EAX,EBX JB L Here SHR pairs with MOV, and CMP pairs with JB. The BTB entry for JB L is thus attached to the address of the SHR EAX,1 instruction. When this BTB entry is met, and if it is in state 2 or 3, then the Pentium will read the target address from the BTB entry, and load the instructions following L into the pipeline. This happens before the branch instruction has been decoded, so the Pentium relies solely on the information in the BTB when doing this. You may remember, that instructions are seldom pairing the first time they are executed (see paragraph 9.2). If the instructions above are not pairing, then the BTB entry should be attached to the address of the CMP instruction, and this entry would be wrong on the next execution, when instructions are pairing. However, in most cases the Pentium is smart enough to not make a BTB entry when there is an unused pairing opportunity, so you don't get a BTB entry until the second execution, and hence you won't get a prediction until the third execution. (In the rare case, where every second instruction is a single-byte instruction, you may get a BTB entry on the first execution which becomes invalid in the second execution, but since the instruction it is attached to will then go to the V-pipe, it is ignored and gives no penalty. A BTB entry is only read if it is attached to the address of a U-pipe instruction). A BTB entry is identified by its set-value which is equal to bits 0-5 of the address it is attached to. Bits 6-31 are then stored in the BTB as a tag. Addresses which are spaced a multiple of 64 bytes apart will have the same set-value. You can have no more than four BTB entries with the same set-value. If you want to check whether your jump instructions contend for the same BTB entries, then you have to compare bits 0-5 of the addresses of the U-pipe instructions in the preceding instruction pairs. This is very tedious, and I have never heard of anybody doing so. There are no tools available to do this job for you, as far as I know. Consecutive branches -------------------- When a jump is mispredicted, then the pipeline gets flushed. If the next instruction pair executed also contains a jump instruction, then the Pentium won't load its target because it cannot load a new target while the pipeline is being flushed. The result is that the second jump instruction is predicted to fall through regardless of the state of its BTB entry. Therefore, if the second jump is also taken, then you will get another penalty. The state of the BTB entry for the second jump instruction does get correctly updated, though. If you have a long chain of taken jump instructions, and the first jump in the chain is mispredicted, then the pipeline will get flushed all the time, and you will get nothing but mispredictions until you meet an instruction pair that does not jump. The most extreme case of this is a loop which jumps to itself: It will get a misprediction penalty for each iteration. This is not the only problem with consecutive jumps. Another problem is that you can have a branch instruction between a BTB entry and the jump instruction it belongs to. If the first branch instruction jumps to somewhere else, then strange things may happen. Consider this example: SHR EAX,1 MOV EBX,[ESI] CMP EAX,EBX JB L1 JMP L2 L1: MOV EAX,EBX INC EBX When JB L1 falls through, then you will get a BTB entry for JMP L2 attached to the address of CMP EAX,EBX. But what will happen when JB L1 later is taken? At the time when the BTB entry for JMP L2 is read, the processor doesn't know that the next instruction pair does not contain a jump instruction, so it will actually predict the instruction pair MOV EAX,EBX / INC EBX to jump to L2. The penalty for predicting non-jump instructions to jump is 3 clock cycles. The BTB entry for JMP L2 will get its state decremented, because it is applied to something which doesn't jump. If we keep going to L1, then the BTB entry for JMP L2 will be decremented to state 1 and 0, so that the problem will disappear until next time JMP L2 is executed. The penalty for predicting the non-jumping instructions to jump only occurs when the jump to L1 is predicted. In the case that JB L1 is mispredictedly jumping, then the pipeline gets flushed and we won't get the false L2 target loaded, so in this case we will not see the penalty of predicting the non-jumping instructions to jump, but we do get the BTB entry for JMP L2 decremented. Suppose, now, that we replace the INC EBX instruction above with another jump instruction. This third jump instruction will then use the same BTB entry as JMP L2 with the possible penalty of predicting a wrong target, (unless it happens to also have L2 as target). To summarize, consecutive jumps can lead to the following problems: - failure to load a jump target when the pipeline is being flushed by a preceding mispredicted jump. - a BTB entry being mis-applied to non-jumping instructions and predicting them to jump. - a second consequence of the above is that a mis-applied BTB entry will get its state decremented, possibly leading to a later misprediction of the jump it belongs to. Even unconditional jumps can be predicted to fall through for this reason. - two jump instructions may share the same BTB entry, leading to the prediction of a wrong target. All this mess may give you a lot of penalties, so you should definitely avoid having an instruction pair containing a jump immediately after another poorly predictable jump. It is time for an illustrative example: CALL P TEST EAX,EAX JZ L2 L1: MOV [EDI],EBX ADD EDI,4 DEC EAX JNZ L1 L2: CALL P This looks like a quite nice and normal piece of code: A function call, a loop which is bypassed when the count is zero, and another function call. How many problems can you spot in this program? First, we may note that the function P is called alternatingly from two different locations. This means that the target for the return from P will be changing all the time. Consequently, the return from P will always be mispredicted. Assume, now, that EAX is zero. The jump to L2 will not have its target loaded because the mispredicted return caused a pipeline flush. Next, the second CALL P will also fail to have its target loaded because JZ L2 caused a pipeline flush. Here we have the situation where a chain of consecutive jumps makes the pipeline flush repeatedly because the first jump was mispredicted. The BTB entry for JZ L2 is stored at the address of P's return instruction. This BTB entry will now be mis-applied to whatever comes after the second CALL P, but that doesn't give a penalty because the pipeline is flushed by the mispredicted second return. Now, let's see what happens if EAX has a nonzero value the next time: JZ L2 is always predicted to fall through because of the flush. The second CALL P has a BTB entry at the address of TEST EAX,EAX. This entry will be mis-applied to the MOV/ADD pair, predicting it to jump to P. This causes a flush which prevents JNZ L1 from loading its target. If we have been here before, then the second CALL P will have another BTB entry at the address of DEC EAX. On the second and third iteration of the loop, this entry will also be mis-applied to the MOV/ADD pair, until it has had its state decremented to 1 or 0. This will not cause a penalty on the second iteration because the flush from JNZ L1 prevents it from loading its false target, but on the third iteration it will. The subsequent iterations of the loop have no penalties, but when it exits, JNZ L1 is mispredicted. The flush would now prevent CALL P from loading its target, were it not for the fact that the BTB entry for CALL P has already been destroyed by being mis-applied several times. We can improve this code by putting in some NOP's to separate all consecutive jumps: CALL P TEST EAX,EAX NOP JZ L2 L1: MOV [EDI],EBX ADD EDI,4 DEC EAX JNZ L1 L2: NOP NOP CALL P The extra NOP's cost 2 clock cycles, but they save much more. Furthermore, JZ L2 is now moved to the U-pipe which reduces its penalty from 4 to 3 when mispredicted. The only problem that remains is that the returns from P are always mispredicted. This problem can only be solved by replacing the call to P by an inline macro (if you have enough code cache). The lesson to learn from this example is that you should always look carefully for consecutive jumps and see if you can save time by inserting some NOP's. You should be particularly aware of those situations where misprediction is unavoidable, such as loop exits and returns from procedures which are called from varying locations. If you have something useful to put in, in stead of the NOP's, then you should of course do so. Multiway branches (case statements) may be implemented either as a tree of branch instructions or as a list of jump addresses. If you choose to use a tree of branch instructions, then you have to include some NOP's or other instructions to separate the consecutive branches. A list of jump addresses may therefore be a better solution on the Pentium. The list of jump addresses should be placed in the data segment. Never put data in the code segment! On the PentiumPro you may choose the branch tree solution if it gives a better prediction. Multiple BTB entries -------------------- While two jump instructions can share the same BTB entry, one jump instruction can also have multiple BTB entries if it is jumped to from different locations, or if its pairing is changing. Example: JZ L1 MOV EAX,1 L1: MOV EBX,2 MOV ECX,3 JC L2 If JZ L1 falls through, then MOV EAX,1 pairs with MOV EBX,2, and MOV ECX,3 pairs with JC L2. The BTB entry for JC L2 will be attached to the address of MOV EAX,1. When JZ L1 is taken, then MOV EBX,2 will pair with MOV ECX,3, JC L2 will be unpaired, and its BTB entry will be at MOV EBX,2. So you have two BTB entries for JC L2. The first entry is used when the zero flag is clear, the other when set. This may actually be an advantage, because it will improve the predictability of the second branch instruction if there is a correllation between the two. Assume, for example, that this code sequence is preceded by a CMP instruction. Then we can be certain that the zero flag and the carry flag will never both be set at the same time. We will get no second BTB entry for JC L2 at MOV EBX,2 because it always falls through if the zero flag is set, and we will never get a misprediction for JC L2 when JZ L1 is taken. The situations where you can take advantage of this trick are rare, however, because you might as well let L1 go to another branch which doesn't test the carry flag at all. Tight loops ----------- In a small loop you will often access the same BTB entry repeatedly with small intervals. This never causes a stall. Rather than waiting for a BTB entry to be updated, the Pentium somehow bypasses the pipeline and gets the resulting state from the last jump before it has been written to the BTB. This mechanism is almost transparent to the user, but it does in some cases have funny effects: You can see a branch prediction going from state 0 to state 1, rather than to state 3, if the zero has not yet been written to the BTB. This happens if the loop has no more than four instruction pairs. In loops with only two instruction pairs you may sometimes have state 0 for two consecutive iterations without going out of the BTB. In such small loops it also happens in rare cases that the prediction uses the state resulting from two iterations ago, rather than from the last iteration. These funny effects will usually not have any negative effects on performance. Avoiding branches ----------------- Sometimes it is possible to obtain the same effect as a branch by ingenious manipulation of bits and flags. For example you may calculate the absoulte value of a signed number without branching: MOV EDX,EAX SAR EDX,31 XOR EAX,EDX SUB EAX,EDX The carry flag is particularly useful for this kind of tricks. Setting carry if a value is zero: CMP [value],1 Setting carry if a value is not zero: XOR EAX,EAX / CMP EAX,[value] Incrementing a counter if carry: ADC EAX,0 Setting a bit for each time the carry is set: RCL EAX,1 Generating a bit mask if carry is set: SBB EAX,EAX This example finds the minimum of two unsigned numbers: if (b < a) a = b; SUB EBX,EAX SBB ECX,ECX AND ECX,EBX ADD EAX,ECX This example chooses between two numbers: if (a != 0) a = b; else a = c; CMP EAX,1 SBB EAX,EAX XOR ECX,EBX AND EAX,ECX XOR EAX,EBX Whether or not such tricks are worth the extra code depend on how predictable a conditional jump would be, whether the extra pairing opportunities of the branch-free code can be utilized, and whether there are other jumps following immediately after which could suffer the penalties of consecutive jumps. Later processors ---------------- The PentiumPro and MMX processors have a more advanced branch prediction mechanism which enable them to correctly predict a simple repetitive pattern. They do not have the asymmetry flaw that the Pentium without MMX has. A conditional jump, which has not been seen before, is predicted taken if it goes backwards (e.g. a loop), and not taken if it goes forward, on these new processors. MMX processors have a return stack buffer (RSB) which remembers where calls come from, so you will no longer get mispredictions when calling a subroutine from varying locations. To make this mechanism work, you should always match calls with returns. This means that you should not use the return instruction as an indirect jump or use an indirect jump as return. Mispredictions are very expensive on the PentiumPro (10-16 clocks!). You should therefore avoid poorly predictable branches, and replace them with conditional moves whenever possible. The new processors also have problems with consecutive jumps, but for different reasons. So avoiding consecutive jumps may be desirable on all processors. 13. PREFIXES ============ An instruction with one or more prefixes may not be able to execute in the V-pipe (se paragraph 8.7), and it may take more than one clock cycle to decode. On the Pentium without MMX, the decoding delay is one clock cycle for each prefix except for the 0Fh prefix of conditional near jumps. The Pentium with MMX has no decoding delay for 0Fh prefix. Segment and repeat prefixes take one clock extra to decode. Address and operand size prefixes take two clocks extra to decode (one for reading the prefix, and one for calculating the length of the instruction). Address size prefixes can be avoided by using 32 bit mode. Segment prefixes can be avoided in 32 bit mode by using a flat memory model. Operand size prefixes can be avoided in 32 bit mode by using only 8 bit and 32 bit integers. Where prefixes are unavoidable, the decoding delay may be masked if a preceding instruction takes more than one clock cycle to execute. The rule for the Pentium without MMX is that any instruction which takes N clock cycles to execute (not to decode) can 'overshadow' the decoding delay of N-1 prefixes in the next two (sometimes three) instructions or instruction pairs. In other words, each extra clock cycle that an instruction takes to execute can be used to decode one prefix in a later instruction. This shadowing effect even extends across a predicted branch. Any instruction which takes more than one clock cycle to execute, and any instruction which is delayed because of an AGI stall, cache miss, misalignment, or any other reason except decoding delay and branch misprediction, has a shadowing effect. The Pentium with MMX has a similar shadowing effect, but the mechanism is different. Decoded instructions are stored in a transparent first-in-first-out (FIFO) buffer, which can hold up to four instructions. As long as there are instructions in the FIFO buffer you get no delay. When the buffer is empty then instructions are executed as soon as they are decoded. The buffer is filled when instructions are decoded faster than they are executed, i.e. when you have unpaired or multi-cycle instructions. The FIFO buffer is emptied when instructions execute faster than they are decoded, i.e. when you have decoding delays due to prefixes. The FIFO buffer is empty after a mispredicted branch. The FIFO buffer can receive two instructions per clock cycle provided that the second instruction is without prefixes and none of the instructions are longer than 7 bytes. The two execution pipelines (U and V) can each receive one instruction per clock cycle from the FIFO buffer. Examples: CLD / REP MOVSD The CLD instruction takes two clock cycles and can therefore overshadow the decoding delay of the REP prefix. The code would take one clock cycle more if the CLD instruction was placed far from the REP MOVSD. CMP DWORD PTR [EBX],0 / MOV EAX,0 / SETNZ AL The CMP instruction takes two clock cycles here because it is a read/modify instruction. The 0FH prefix of the SETNZ instruction is decoded during the second clock cycle of the CMP instruction, so that the decoding delay is hidden. 14. REDUCING CODE SIZE ====================== As explained in paragraph 6, the code cache is 8 kb. If you have problems keeping the critical parts of your code within the code cache, then you may consider reducing the size of your code. 32 bit code is usually bigger than 16 bit code because addresses and data constants take 4 bytes in 32 bit code and only 2 bytes in 16 bit code. However, 16 bit code has other penalties such as prefixes and problems with accessing adjacent words simultaneously (see chapter 10). Some other methods for reducing the size or your code are discussed below. Both jump addresses, data addresses, and data constants take less space if they can be expressed as a sign-extended byte, i.e. if they are within the interval from -128 to +127. For jump addresses this means that short jumps take two bytes of code, whereas jumps beyond 127 bytes take 5 bytes if unconditional and 6 bytes if conditional. Likewise, data addresses take less space if they can be expressed as a pointer and a displacement between -128 and +127. Example: MOV EBX,DS:[100000] / ADD EBX,DS:[100004] ; 12 bytes Reduce to: MOV EAX,100000 / MOV EBX,[EAX] / ADD EBX,[EAX+4] ; 10 bytes The advantage of using a pointer obviously increases if you use it many times. Storing data on the stack and using EBP or ESP as pointer will thus make your code smaller than if you use static memory locations and absolute addresses, provided of course that your data are within 127 bytes of the pointer. Using PUSH and POP to write and read temporary data is even more advantageous. Data constants may also take less space if they are between -128 and +127. Most instructions with immediate operands have a short form where the operand is a sign-extended single byte. Examples: PUSH 200 ; 5 bytes PUSH 100 ; 2 bytes ADD EBX,128 ; 6 bytes SUB EBX,-128 ; 3 bytes The most important instruction with an immediate operand which doesn't have such a short form is MOV. Examples: MOV EAX, 1 ; 5 bytes XOR EAX,EAX / INC EAX ; 3 bytes PUSH 1 / POP EAX ; 3 bytes If the same constant is used more than once then you may of course load it into a register. Example: MOV DWORD PTR [EBX],0 / MOV DWORD PTR [EBX+4],0 ; 13 bytes XOR EAX,EAX / MOV [EBX],EAX / MOV [EBX+4],EAX ; 7 bytes You may also consider that different instructions have different lengths. The following instructions take only one byte and are therefore very attractive: PUSH reg, POP reg, INC reg32, DEC reg32. INC and DEC with 8 bit registers take 2 bytes, so INC EAX is shorter than INC AL. XCHG EAX,reg is also a single-byte instruction and thus takes less space than MOV EAX,reg, but it is slower and not pairable. Some instructions take one byte less when they use the accumulator than when they use any other register. Examples: MOV EAX,DS:[100000] is smaller than MOV EBX,DS:[100000] ADD EAX,1000 is smaller than ADD EBX,1000 Instructions with pointers take one byte less when they have only a base pointer (not ESP) and a displacement than when they have a scaled index register, or both base pointer and index register, or ESP as base pointer. Examples: MOV EAX,[array][EBX] is smaller than MOV EAX,[array][EBX*4] MOV EAX,[EBP+12] is smaller than MOV EAX,[ESP+12] Instructions with EBP as base pointer and no displacement and no index take one byte more than with other registers: MOV EAX,[EBX] is smaller than MOV EAX,[EBP], but MOV EAX,[EBX+4] is same size as MOV EAX,[EBP+4]. 15. SCHEDULING FLOATING POINT CODE ================================== Floating point instructions cannot pair the way integer instructions can, except for one special case, defined by the following rules: - the first instruction (executing in the U-pipe) must be FLD, FADD, FSUB, FMUL, FDIV, FCOM, FCHS, or FABS - the second instruction (in V-pipe) must be FXCH - the instruction following the FXCH must be a floating point instruction, otherwise the FXCH will take an extra clock cycle. This special pairing is important, as will be explained shortly. While floating point instructions in general cannot be paired, many can be pipelined, i.e. one instruction can begin before the previous instruction has finished. Example: FADD ST(1),ST(0) ; clock cycle 1-3 FADD ST(2),ST(0) ; clock cycle 2-4 FADD ST(3),ST(0) ; clock cycle 3-5 FADD ST(4),ST(0) ; clock cycle 4-6 Obviously, two instructions cannot overlap if the second instruction needs the result of the first. Since almost all floating point instructions involve the top of stack register, ST(0), there are seemingly not very many possibilities for making an instruction independent of the result of previous instructions. The solution to this problem is register renaming. The FXCH instruction does not in reality swap the contents of two registers, it only swaps their names. Instructions which push or pop the register stack also work by renaming. Register renaming has been highly optimized on the Pentium so that a register may be renamed while in use. Register renaming never causes stalls - it is even possible to rename a register more than once in the same clock cycle, as for example when you pair FLD or FCOMPP with FXCH. By the proper use of FXCH instructions you may obtain a lot of overlapping in your floating point code. Example: FLD [a1] ; clock cycle 1 FADD [a2] ; clock cycle 2-4 FLD [b1] ; clock cycle 3 FADD [b2] ; clock cycle 4-6 FLD [c1] ; clock cycle 5 FADD [c2] ; clock cycle 6-8 FXCH ST(2) ; clock cycle 6 FADD [a3] ; clock cycle 7-9 FXCH ST(1) ; clock cycle 7 FADD [b3] ; clock cycle 8-10 FXCH ST(2) ; clock cycle 8 FADD [c3] ; clock cycle 9-11 FXCH ST(1) ; clock cycle 9 FADD [a4] ; clock cycle 10-12 FXCH ST(2) ; clock cycle 10 FADD [b4] ; clock cycle 11-13 FXCH ST(1) ; clock cycle 11 FADD [c4] ; clock cycle 12-14 FXCH ST(2) ; clock cycle 12 In the above example we are interleaving three independent threads. Each FADD takes 3 clock cycles, and we can start a new FADD in each clock cycle. When we have started an FADD in the 'a' thread we have time to start two new FADD instructions in the 'b' and 'c' threads before returning to the 'a' thread, so every third FADD belongs to the same thread. We are using FXCH instructions every time to get the register that belongs to the desired thread into ST(0). As you can see in the example above, this generates a regular pattern, but note well that the FXCH instructions repeat with a period of two while the threads have a period of three. This can be quite confusing, so you have to 'play computer' in order to know which registers are where. All versions of the instructions FADD, FSUB, FMUL, and FILD take 3 clock cycles and are able to overlap, so that these instructions may be scheduled using the method described above. Using a memory operand does not take more time than a register operand if the memory operand is in the level 1 cache and properly aligned. By now you must be used to the rules having exceptions, and the overlapping rule is no exception: You cannot start an FMUL instruction one clock cycle after another FMUL instruction, because the FMUL circuitry is not perfectly pipelined. It is recommended that you put another instruction in between two FMUL's. Example: FLD [a1] ; clock cycle 1 FLD [b1] ; clock cycle 2 FLD [c1] ; clock cycle 3 FXCH ST(2) ; clock cycle 3 FMUL [a2] ; clock cycle 4-6 FXCH ; clock cycle 4 FMUL [b2] ; clock cycle 5-7 (stall) FXCH ST(2) ; clock cycle 5 FMUL [c2] ; clock cycle 7-9 (stall) FXCH ; clock cycle 7 FSTP [a3] ; clock cycle 8-9 FXCH ; clock cycle 10 (unpaired) FSTP [b3] ; clock cycle 11-12 FSTP [c3] ; clock cycle 13-14 Here you have a stall before FMUL [b2] and before FMUL [c2] because another FMUL started in the preceding clock cycle. You can improve this code by putting FLD instructions in between the FMUL's: FLD [a1] ; clock cycle 1 FMUL [a2] ; clock cycle 2-4 FLD [b1] ; clock cycle 3 FMUL [b2] ; clock cycle 4-6 FLD [c1] ; clock cycle 5 FMUL [c2] ; clock cycle 6-8 FXCH ST(2) ; clock cycle 6 FSTP [a3] ; clock cycle 7-8 FSTP [b3] ; clock cycle 9-10 FSTP [c3] ; clock cycle 11-12 In other cases you may put FADD, FSUB, or anything else in between FMUL's to avoid the stalls. Overlapping floating point instructions requires of course that you have some independent threads that you can interleave. If you have only one big formula to execute, then you may compute parts of the formula in parallel to achieve overlapping. If, for example, you want to add six numbers, then you may split the operations into two threads with three numbers in each, and add the two threads in the end: FLD [a] ; clock cycle 1 FADD [b] ; clock cycle 2-4 FLD [c] ; clock cycle 3 FADD [d] ; clock cycle 4-6 FXCH ; clock cycle 4 FADD [e] ; clock cycle 5-7 FXCH ; clock cycle 5 FADD [f] ; clock cycle 7-9 (stall) FADD ; clock cycle 10-12 (stall) Here we have a one clock stall before FADD [f] because it is waiting for the result of FADD [d] and a two clock stall before the last FADD because it is waiting for the result of FADD [f]. The latter stall can be hidden by filling in some integer instructions, but the first stall can not because an integer instruction at this place would make the FXCH unpairable. The first stall can be avoided by having three threads rather than two, but that would cost an extra FLD so we do not save anything by having three threads rather than two unless there are at least eight numbers to add. Not all floating point instructions can overlap. And some floating point instructions can overlap more subsequent integer instructions than subsequent floating point instructions. The FDIV instruction, for example, takes 39 clock cycles. All but the first clock cycle can overlap with integer instructions, but only the last two clock cycles can overlap with floating point instructions. Example: FDIV ; clock cycle 1-39 FXCH ; clock cycle 1-2 CMC ; clock cycle 3-4 RCR EAX,1 ; clock cycle 5 INC EBX ; clock cycle 5 FADD [x] ; clock cycle 38-40 FXCH ; clock cycle 38 FMUL [y] ; clock cycle 40-42 The first FXCH pairs with the FDIV, but takes an extra clock cycle because it is not followed by a floating point instruction. The CMC starts before the FDIV is finished, but has to wait for the FXCH to finish. The RCR and INC instructions are pairing. The FADD has to wait till clock 38 because new floating point instructions can only execute during the last two clock cycles of the FDIV. The second FXCH pairs with the FADD. The FMUL has to wait for the FDIV to finish because it uses the result of the division. If you have nothing else to put in after a floating point instruction with a large integer overlap, such as FDIV or FSQRT, then you may put in a dummy read from an address which you expect to need later in the program to make sure it is in the level one cache. Example: FDIV QWORD PTR [EBX] CMP [ESI],EAX FMUL QWORD PTR [ESI] Here we use the integer overlap to pre-load the value at [ESI] into the cache while the FDIV is being computed (we don't care what the result of the CMP is). Paragraph 21 gives a complete listing of floating point instructions, and what they can pair or overlap with. One floating point instruction requires special mentioning, namely FST or FSTP with a memory operand. This instruction takes two clock cycles in the execution stage, but it seems to start converting the value in ST(0) already at the address decode stage in the pipeline, so that there is a one clock stall if the value to store is not ready one clock cycle in advance. This is analogous to an AGI stall. Example: FLD [a1] ; clock cycle 1 FADD [a2] ; clock cycle 2-4 FLD [b1] ; clock cycle 3 FADD [b2] ; clock cycle 4-6 FXCH ; clock cycle 4 FSTP [a3] ; clock cycle 6-7 FSTP [b3] ; clock cycle 8-9 The FSTP [a3] stalls for one clock cycle because the result of FADD [a2] is not ready in the preceding clock cycle. In many cases you cannot hide this type of stall without scheduling your floating point code into four threads or putting some integer instructions in between. No other instructions have this weirdness. The two clock cycles in the execution stage of the FST(P) instruction cannot pair or overlap with any subsequent instructions. Instructions with integer operands such as FIADD, FISUB, FIMUL, FIDIV, FICOM may be split up into simpler operations in order to improve overlapping. Example: FILD [a] ; clock cycle 1-3 FIMUL [b] ; clock cycle 4-9 Split up into: FILD [a] ; clock cycle 1-3 FILD [b] ; clock cycle 2-4 FMUL ; clock cycle 5-7 In this example, you save two clocks by overlapping the two FILD instructions. 16. Loop Optimation =================== When analyzing a program you often find that 99% of the time consumption lies in the innermost loop. The way to improve the speed is to carefully optimize the most time-consuming loop using ASM language. The rest of the program may be left in high-level language. A loop generally contains a counter controlling how many times to iterate, and often array access reading or writing one array element for each iteration. I have chosen as example a procedure which reads integers from an array, changes the sign of each integer, and stores the results in another array. A C language code for this procedure would be: void ChangeSign (int * A, int * B, int N) { int i; for (i=0; i<N; i++) B[i] = -A[i];} Translating to assembler, we might write the procedure like this: Example 1: _ChangeSign PROCEDURE NEAR PUSH ESI PUSH EDI A EQU DWORD PTR [ESP+12] B EQU DWORD PTR [ESP+16] N EQU DWORD PTR [ESP+20] MOV ECX, [N] JECXZ L2 MOV ESI, [A] MOV EDI, [B] CLD L1: LODSD NEG EAX STOSD LOOP L1 L2: POP EDI POP ESI RET ; (no extra pop if C calling convention) _ChangeSign ENDP This looks like a nice solution, but it is not optimal because it uses slow non-pairable instructions. It takes 11 clock cycles per iteration if all data are in the level one cache. Using pairable instructions only -------------------------------- Example 2: MOV ECX, [N] MOV ESI, [A] TEST ECX, ECX JZ SHORT L2 MOV EDI, [B] L1: MOV EAX, [ESI] ; u XOR EBX, EBX ; v (pairs) ADD ESI, 4 ; u SUB EBX, EAX ; v (pairs) MOV [EDI], EBX ; u ADD EDI, 4 ; v (pairs) DEC ECX ; u JNZ L1 ; v (pairs) L2: Here we have used pairable instructions only, and scheduled the instruc- tions so that everything pairs. It now takes only 4 clock cycles per iteration. We could have obtained the same speed without splitting the NEG instruction, but the other unpairable instructions should be split up. Using the same register for counter and index --------------------------------------------- Example 3: MOV ESI, [A] MOV EDI, [B] MOV ECX, [N] XOR EDX, EDX TEST ECX, ECX JZ SHORT L2 L1: MOV EAX, [ESI+4*EDX] ; u NEG EAX ; u MOV [EDI+4*EDX], EAX ; u INC EDX ; v (pairs) CMP EDX, ECX ; u JB L1 ; v (pairs) L2: Using the same register for counter and index gives us fewer instructions in the body of the loop, but it still takes 4 clocks because we have two unpaired instructions. Letting the counter end at zero ------------------------------- We want to get rid of the CMP instruction in example 3 by letting the counter end at zero and use the zero flag for detecting when we are finished as we did in example 2. One way to do this would be to execute the loop backwards taking the last array elements first. However, data caches are optimized for accessing data forwards, not backwards, so if cache misses are likely, then you should rather start the counter at -N and count through negative values up to zero. The base registers should then point to the end of the arrays rather than the beginning: Example 4: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L2 L1: MOV EAX, [ESI+4*ECX] ; u NEG EAX ; u MOV [EDI+4*ECX], EAX ; u INC ECX ; v (pairs) JNZ L1 ; u L2: We are now down at five instructions in the loop body but it still takes 4 clocks because of poor pairing. (If the addresses and sizes of the arrays are constants we may save two registers by substituting A+SIZE A for ESI and B+SIZE B for EDI). Now let's see how we can improve pairing. Pairing calculations with loop overhead --------------------------------------- We may want to improve pairing by intermingling calculations with the loop control instructions. If we want to put something in between INC ECX and JNZ L1, it has to be something that doesn't affect the zero flag. The MOV [EDI+4*ECX],EBX instruction after INC ECX would generate an AGI delay, so we have to be more ingenious: Example 5: MOV EAX, [N] XOR ECX, ECX SHL EAX, 2 ; 4 * N JZ SHORT L3 MOV ESI, [A] MOV EDI, [B] SUB ECX, EAX ; - 4 * N ADD ESI, EAX ; point to end of array A ADD EDI, EAX ; point to end of array B JMP SHORT L2 L1: MOV [EDI+ECX-4], EAX ; u L2: MOV EAX, [ESI+ECX] ; v (pairs) XOR EAX, -1 ; u ADD ECX, 4 ; v (pairs) INC EAX ; u JNC L1 ; v (pairs) MOV [EDI+ECX-4], EAX L3: I have used a different way to calculate the negative of EAX here: inverting all bits and adding one. The reason why I am using this method is that I can use a dirty trick with the INC instruction: INC doesn't change the carry flag, whereas ADD does. I am using ADD rather than INC to increment my loop counter and testing the carry flag rather than the zero flag. It is then possible to put the INC EAX in between without affecting the carry flag. You may think that we could have used LEA EAX,[EAX+1] here instead of INC EAX, at least that doesn't change any flags, but the LEA instruction would have an AGI stall so that's not the best solution. I have obtained perfect pairing here and the loop now takes only 3 clock cycles. Whether you want to increment the loop counter by 1 (as in example 4) or by 4 (as in example 5) is a matter of taste, it makes no difference in loop timing. Overlapping the end of one operation with the beginning of the next ------------------------------------------------------------------- The method used in example 5 is not very generally applicable so we may look for other methods of improving pairing opportunities. One way is to reorganize the loop so that the end of one operation overlaps with the beginning of the next. I will call this convoluting the loop. A convoluted loop has an unfinished operation at the end of each loop iteration which will be finished in the next run. Actually, example 5 did pair the last MOV of one iteration with the first MOV of the next, but we want to explore this method further: Example 6: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L3 XOR EBX, EBX MOV EAX, [ESI+4*ECX] INC ECX JZ SHORT L2 L1: SUB EBX, EAX ; u MOV EAX, [ESI+4*ECX] ; v (pairs) MOV [EDI+4*ECX-4], EBX ; u INC ECX ; v (pairs) MOV EBX, 0 ; u JNZ L1 ; v (pairs) L2: SUB EBX, EAX MOV [EDI+4*ECX-4], EBX L3: Here we begin reading the second value before we have stored the first, and this of course improves pairing opportunities. The MOV EBX,0 instruction has been put in between INC ECX and JNZ L1 not to improve pairing but to avoid AGI stall. Rolling out a loop ------------------ The most generally applicable way to improve pairing opportunities is to do two operations for each run and do half as many runs. This is called rolling out a loop: Example 7: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L2 TEST AL,1 ; test if N is odd JZ SHORT L1 MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one NEG EAX MOV [EDI+4*ECX], EAX INC ECX ; make counter even JZ SHORT L2 ; N = 1 L1: MOV EAX, [ESI+4*ECX] ; u MOV EBX, [ESI+4*ECX+4] ; v (pairs) NEG EAX ; u NEG EBX ; u MOV [EDI+4*ECX], EAX ; u MOV [EDI+4*ECX+4], EBX ; v (pairs) ADD ECX, 2 ; u JNZ L1 ; v (pairs) L2: Now we are doing two operations in parallel which gives the best pairing opportunities. We have to test if N is odd and if so do one operation outside the loop because the loop can only do an even number of operations. The loop has an AGI stall at the first MOV instruction because ECX has been incremented in the preceding clock cycle. The loop therefore takes 6 clock cycles for two operations. Reorganizing a loop to remove AGI stall --------------------------------------- Example 8: MOV ESI, [A] MOV EAX, [N] MOV EDI, [B] XOR ECX, ECX LEA ESI, [ESI+4*EAX] ; point to end of array A SUB ECX, EAX ; -N LEA EDI, [EDI+4*EAX] ; point to end of array B JZ SHORT L3 TEST AL,1 ; test if N is odd JZ SHORT L2 MOV EAX, [ESI+4*ECX] ; N is odd. do the odd one NEG EAX ; no pairing opportunity MOV [EDI+4*ECX-4], EAX INC ECX ; make counter even JNZ SHORT L2 NOP ; add NOP's if JNZ L2 not predictable NOP JMP SHORT L3 ; N = 1 L1: NEG EAX ; u NEG EBX ; u MOV [EDI+4*ECX-8], EAX ; u MOV [EDI+4*ECX-4], EBX ; v (pairs) L2: MOV EAX, [ESI+4*ECX] ; u MOV EBX, [ESI+4*ECX+4] ; v (pairs) ADD ECX, 2 ; u JNZ L1 ; v (pairs) NEG EAX NEG EBX MOV [EDI+4*ECX-8], EAX MOV [EDI+4*ECX-4], EBX L3: The trick is to find a pair of instructions that do not use the loop counter as index and reorganize the loop so that the counter is incremented in the preceding clock cycle. We are now down at 5 clock cycles for two operations which is close to the best possible. If data caching is critical, then you may improve the speed further by combining the A and B arrays into one structured array so that each B[i] comes immediately after the corresponding A[i]. If the structured array is aligned by at least 8 then B[i] will always be in the same cache line as A[i], so you will never have a cache miss when writing B[i]. This may of course have a tradeoff in other parts of the program so you have to weigh the costs against the benefits. Rolling out by more than 2 -------------------------- You may think of doing more than two operations per iteration in order to reduce the loop overhead per operation. But since the loop overhead in most cases can be reduced to only one clock cycle per iteration, then rolling out the loop by 4 rather than by 2 would only save 1/4 clock cycle per operation, which is hardly worth the effort. Only if the loop overhead cannot be reduced to one clock cycle and if N is very big, should you think of unrolling by 4. The drawbacks of excessive loop unrolling are: 1. You need to calculate N MODULO R, where R is the unrolling factor, and do N MODULO R operations before or after the main loop in order to make the remaining number of operations divisible by R. This takes a lot of extra code and poorly predictable branches. And the loop body of course also becomes bigger. 2. A Piece of code usually takes much more time the first time it executes, and the penalty of first time execution is bigger the more code you have, especially if N is small. 3. Excessive code size makes the utilization of the code cache less effective. Handling multiple 8 or 16 bit operands simultaneously in 32 bit registers ------------------------------------------------------------------------- If you need to manipulate arrays of 8 or 16 bit operands, then there is a problem with unrolled loops because you may not be able to pair two memory access operations. For example MOV AL,[ESI] / MOV BL,[ESI+1] will not pair if the two operands are within the same dword of memory. But there may be a much smarter method, namely to handle four bytes at a time in the same 32 bit register. The following example adds 2 to all elements of an array of bytes: Example 9: MOV ESI, [A] ; address of byte array MOV ECX, [N] ; number of elements in byte array TEST ECX, ECX ; test if N is 0 JZ SHORT L2 MOV EAX, [ESI] ; read first four bytes L1: MOV EBX, EAX ; copy into EBX AND EAX, 7F7F7F7FH ; get lower 7 bits of each byte in EAX XOR EBX, EAX ; get the highest bit of each byte in EBX ADD EAX, 02020202H ; add desired value to all four bytes XOR EBX, EAX ; combine bits again MOV EAX, [ESI+4] ; read next four bytes MOV [ESI], EBX ; store result ADD ESI, 4 ; increment pointer SUB ECX, 4 ; decrement loop counter JA L1 ; loop L2: This loop takes 5 clock cycles for every 4 bytes. The array should of course be aligned by 4. If the number of elements in the array is not divisible by four, then you may pad it in the end with a few extra bytes to make the length divisible by four. This loop will always read past the end of the array, so you should make sure the array is not placed at the end of a segment to avoid a general protection error. Note that I have masked out the highest bit of each byte to avoid a possible carry from each byte into the next when adding. I am using XOR rather than ADD when putting in the high bit again to avoid carry. The ADD ESI,4 instruction could have been avoided by using the loop counter as index as in example 4. However, this would give an odd number of instructions in the loop body, so there would be one unpaired instruction and the loop would still take 5 clocks. Making the branch instruction unpaired would save one clock after the last operation when the branch is mispredicted, but we would have to spend an extra clock cycle in the prolog code to setup a pointer to the end of the array and calculate -N, so the two methods will be exactly equally fast. The method presented here is the simplest and shortest. The next example finds the length of a zero-terminated string by searching for the first byte of zero. It is faster than using REP SCASB: Example 10: STRLEN PROC NEAR MOV EAX,[ESP+4] ; get pointer MOV EDX,7 ADD EDX,EAX ; pointer+7 used in the end PUSH EBX MOV EBX,[EAX] ; read first 4 bytes ADD EAX,4 ; increment pointer L1: LEA ECX,[EBX-01010101H] ; subtract 1 from each byte XOR EBX,-1 ; invert all bytes AND ECX,EBX ; and these two MOV EBX,[EAX] ; read next 4 bytes ADD EAX,4 ; increment pointer AND ECX,80808080H ; test all sign bits JZ L1 ; no zero bytes, continue loop TEST ECX,00008080H ; test first two bytes JNZ SHORT L2 SHR ECX,16 ; not in the first 2 bytes ADD EAX,2 L2: SHL CL,1 ; use carry flag to avoid a branch POP EBX SBB EAX,EDX ; compute length RET ; (or RET 4 if pascal) STRLEN ENDP Again we have used the method of overlapping the end of one operation with the beginning of the next to improve pairing. I have not unrolled the loop because it is likely to repeat relatively few times. The string should of course be aligned by 4. The code will always read past the end of the string, so the string should not be placed at the end of a segment. The loop body has an odd number of instructions so there is one unpaired. Making the branch instruction unpaired rather than one of the other instructions has the advantage that it saves 1 clock cycle when the branch is mispredicted. The TEST ECX,00008080H instruction is non-pairable. You could use the pairable instruction OR CH,CL here instead, but then you would have to put in a NOP or something to avoid the penalties of consequtive branches. Another problem with OR CH,CL is that it would cause a partial register stall on a 486 or PentiumPro processor. So I have chosen to keep the unpairable TEST instruction. Handling 4 bytes simultaneously can be quite difficult. The code uses a formula which generates a nonzero value for a byte if, and only if, the byte is zero. This makes it possible to test all four bytes in one operation. This algorithm involves the subtraction of 1 from all bytes (in the LEA instruction). I have not masked out the highest bit of each byte before subtracting, as I did in the previous example, so the subtraction may generate a borrow to the next byte, but only if it is zero, and this is exactly the situation where we don't care what the next byte is, because we are searching forwards for the first zero. If we were searching backwards then we would have to re-read the dword after detecting a zero, and then test all four bytes to find the last zero, or use BSWAP to reverse the order of the bytes. If you want to search for a byte value other than zero, then you may XOR all four bytes with the value you are searching for, and then use the method above to search for zero. Handling multiple operands in the same register is easier on the MMX processors because they have special instructions and special 64 bit registers for this purpose. Using the MMX instructions has a high penalty if you are using floating point instructions shortly afterwards, so there may still be situations where you want to use 32 bit registers as in the examples above. Loops with floating point operations ------------------------------------ The methods of optimizing floating point loops are basically the same as for integer loops, although the floating point instructions are overlapping rather than pairing. Consider the C language code: int i, n; double * X; double * Y; double DA; for (i=0; i<n; i++) Y[i] = Y[i] - DA * X[i]; This piece of code (called DAXPY) has been studied extensively because it is the key to solving linear equations. Example 11: DSIZE = 8 ; data size MOV EAX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y XOR ECX, ECX LEA ESI, [ESI+DSIZE*EAX] ; point to end of X SUB ECX, EAX ; -N LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y JZ SHORT L3 ; test for N = 0 FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[0] JMP SHORT L2 ; jump into loop L1: FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+DSIZE*ECX] ; DA * X[i] FXCH ; get old result FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store Y[i] L2: FSUBR DSIZE PTR [EDI+DSIZE*ECX] ; subtract from Y[i] INC ECX ; increment index JNZ L1 ; loop FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] ; store last result L3: Here we are using the same methods as in example 6: Using the loop counter as index register and counting through negative values up to zero. The end of one operation overlaps with the beginning of the next. The interleaving of floating point operations work perfectly here: The 2 clock stall between FMUL and FSUBR is filled with the FSTP of the previous result. The 3 clock stall between FSUBR and FSTP is filled with the loop overhead and the first two instructions of the next operation. An AGI stall has been avoided by reading the only parameter that doesn't depend on the index in the first clock cycle after the index has been incremented. This solution takes 6 clock cycles per operation, which is better than the unrolled solution published by Intel! Unrolling floating point loops ------------------------------ The DAXPY loop unrolled by 3 is quite complicated: Example 12: DSIZE = 8 ; data size IF DSIZE EQ 4 SHIFTCOUNT = 2 ELSE SHIFTCOUNT = 3 ENDIF MOV EAX, [N] ; number of elements MOV ECX, 3*DSIZE ; counter bias SHL EAX, SHIFTCOUNT ; DSIZE*N JZ L4 ; N = 0 MOV ESI, [X] ; pointer to X SUB ECX, EAX ; (3-N)*DSIZE MOV EDI, [Y] ; pointer to Y SUB ESI, ECX ; end of pointer - bias SUB EDI, ECX TEST ECX, ECX FLD DSIZE PTR [ESI+ECX] ; first X JNS SHORT L2 ; less than 4 operations L1: ; main loop FMUL DSIZE PTR [DA] FLD DSIZE PTR [ESI+ECX+DSIZE] FMUL DSIZE PTR [DA] FXCH FSUBR DSIZE PTR [EDI+ECX] FXCH FLD DSIZE PTR [ESI+ECX+2*DSIZE] FMUL DSIZE PTR [DA] FXCH FSUBR DSIZE PTR [EDI+ECX+DSIZE] FXCH ST(2) FSTP DSIZE PTR [EDI+ECX] FSUBR DSIZE PTR [EDI+ECX+2*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+DSIZE] FLD DSIZE PTR [ESI+ECX+3*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+2*DSIZE] ADD ECX, 3*DSIZE JS L1 ; loop L2: FMUL DSIZE PTR [DA] ; finish leftover operation FSUBR DSIZE PTR [EDI+ECX] SUB ECX, 2*DSIZE ; change pointer bias JZ SHORT L3 ; finished FLD DSIZE PTR [DA] ; start next operation FMUL DSIZE PTR [ESI+ECX+3*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+2*DSIZE] FSUBR DSIZE PTR [EDI+ECX+3*DSIZE] ADD ECX, 1*DSIZE JZ SHORT L3 ; finished FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+ECX+3*DSIZE] FXCH FSTP DSIZE PTR [EDI+ECX+2*DSIZE] FSUBR DSIZE PTR [EDI+ECX+3*DSIZE] ADD ECX, 1*DSIZE L3: FSTP DSIZE PTR [EDI+ECX+2*DSIZE] L4: The reason why I am showing you how to unroll a loop by 3 is not to recommend it, but to warn you how difficult it is! Be prepared to spend a considerable amount of time debugging and verifying your code when doing something like this. There are several problems to take care of: In most cases, you cannot remove all stalls from a floating point loop unrolled by less than 4 unless you convolute it (i.e. there are unfinished operations at the end of each run which are being finished in the next run). The last FLD in the main loop above is the beginning of the first operation in the next run. It would be tempting here to make a solution which reads past the end of the array and then discards the extra value in the end, as in example 9 and 10, but that is not recommended in floating point loops because the reading of the extra value might generate a denormal operand exception in case the memory position after the array doesn't contain a valid floating point number. To avoid this, we have to do at least one more operation after the main loop. The number of operations to do outside an unrolled loop would normally be N MODULO R, where N is the number of operations, and R is the unrolling factor. But in the case of a convoluted loop, we have to do one more, i.e. (N-1) MODULO R + 1, for the abovementioned reason. Normally, we would prefer to do the extra operations before the main loop, but here we have to do them afterwards for two reasons: One reason is to take care of the leftover operand from the convolution. The other reason is that calculating the number of extra operations requires a division if R is not a power of 2, and a division is time consuming. Doing the extra operations after the loop saves the division. The next problem is to calculate how to bias the loop counter so that it will change sign at the right time, and adjust the base pointers so as to compensate for this bias. Finally, you have to make sure the leftover operand from the convolution is handled correctly for all values of N. The epilog code doing 1-3 operations could have been implemented as a separate loop, but that would cost an extra branch misprediction, so the solution above is faster. Now that I have scared you by demonstrating how difficult it is to unroll by 3, I will show you that it is much easier to unroll by 4: Example 13: DSIZE = 8 ; data size MOV EAX, [N] ; number of elements MOV ESI, [X] ; pointer to X MOV EDI, [Y] ; pointer to Y XOR ECX, ECX LEA ESI, [ESI+DSIZE*EAX] ; point to end of X SUB ECX, EAX ; -N LEA EDI, [EDI+DSIZE*EAX] ; point to end of Y TEST AL,1 ; test if N is odd JZ SHORT L1 FLD DSIZE PTR [DA] ; do the odd operation FMUL DSIZE PTR [ESI+DSIZE*ECX] FSUBR DSIZE PTR [EDI+DSIZE*ECX] INC ECX ; adjust counter FSTP DSIZE PTR [EDI+DSIZE*ECX-DSIZE] L1: TEST AL,2 ; test for possibly 2 more operations JZ L2 FLD DSIZE PTR [DA] ; N MOD 4 = 2 or 3. Do two more FMUL DSIZE PTR [ESI+DSIZE*ECX] FLD DSIZE PTR [DA] FMUL DSIZE PTR [ESI+DSIZE*ECX+DSIZE] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE] FXCH FSTP DSIZE PTR [EDI+DSIZE*ECX] FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE] ADD ECX, 2 ; counter is now divisible by 4 L2: TEST ECX, ECX JZ L4 ; no more operations L3: ; main loop: FLD DSIZE PTR [DA] FLD DSIZE PTR [ESI+DSIZE*ECX] FMUL ST,ST(1) FLD DSIZE PTR [ESI+DSIZE*ECX+DSIZE] FMUL ST,ST(2) FLD DSIZE PTR [ESI+DSIZE*ECX+2*DSIZE] FMUL ST,ST(3) FXCH ST(2) FSUBR DSIZE PTR [EDI+DSIZE*ECX] FXCH ST(3) FMUL DSIZE PTR [ESI+DSIZE*ECX+3*DSIZE] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX+DSIZE] FXCH ST(2) FSUBR DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE] FXCH FSUBR DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE] FXCH ST(3) FSTP DSIZE PTR [EDI+DSIZE*ECX] FSTP DSIZE PTR [EDI+DSIZE*ECX+2*DSIZE] FSTP DSIZE PTR [EDI+DSIZE*ECX+DSIZE] FSTP DSIZE PTR [EDI+DSIZE*ECX+3*DSIZE] ADD ECX, 4 ; increment index by 4 JNZ L3 ; loop L4: It is usually quite easy to find a stall-free solution when unrolling by 4, and there is no need for convolution. The number of extra operations to do outside the main loop is N MODULO 4, which can be calculated easily without division, simply by testing the two lowest bits in N. The extra operations are done before the main loop rather than after, to make the handling of the loop counter simpler. The tradeoff of loop unrolling is that the extra operations outside the loop are slower due to incomplete overlapping and possible branch mispredictions, and the first time penalty is higher because of increased code size. As a general recommendation, I would say that if N is big or if convoluting the loop without unrolling cannot remove enough stalls, then you should unroll critical integer loops by 2 and floating point loops by 4. 17. DISCUSSION OF SPECIAL INSTRUCTIONS ====================================== 17.1 LEA -------- The LEA instruction is useful for many purposes because it can do a shift, two additions, and a move in just one pairable instruction taking one clock cycle. Example: LEA EAX,[EBX+8*ECX-1000] is much faster than MOV EAX,ECX / SHL EAX,3 / ADD EAX,EBX / SUB EAX,1000 The LEA instruction can also be used to do an add or shift without changing the flags. The source and destination need not have the same word size, so LEA EAX,[BX] is a useful replacement for MOVZX EAX,BX. You must be aware, however, that the LEA instruction will suffer an AGI stall if it uses a base or index register which has been changed in the preceding clock cycle. Since the LEA instruction is pairable in the V-pipe and shift instructions are not, you may use LEA as a substitute for a SHL by 1, 2, or 3 if you want the instruction to execute in the V-pipe. The 32 bit processors have no documented addressing mode with a scaled index register and nothing else, so an instruction like LEA EAX,[EAX*2] is actually coded as LEA EAX,[EAX*2+00000000] with an immediate displacement of 4 bytes. You may reduce the instruction size by instead writing LEA EAX,[EAX+EAX] or even better ADD EAX,EAX. The latter code cannot have an AGI delay. If you happen to have a register which is zero (like a loop counter after a loop), then you may use it as a base register to reduce the code size: LEA EAX,[EBX*4] ; 7 bytes LEA EAX,[ECX+EBX*4] ; 3 bytes 17.2 MOV [MEM],ACCUM -------------------- The instructions MOV [mem],AL MOV [mem],AX MOV [mem],EAX are treated by the pairing circuitry as if they were writing to the accumulator. Thus the following instructions do not pair: MOV [mydata], EAX MOV EBX, EAX This problem occurs only with the short form of the MOV instruction which can not have a base or index register, and which can only have the accumulator as source. You can avoid the problem by using another register, by reordering your instructions, by using a pointer, or by hard-coding the general form of the MOV instruction. In 32 bit mode you can write the general form of MOV [mem],EAX: DB 89H, 05H DD OFFSET DS:[mem] In 16 bit mode you can write the general form of MOV [mem],AX: DB 89H, 06H DW OFFSET DS:[mem] To use AL instead of (e)AX, you replace 89H with 88H 17.3 TEST --------- The TEST instruction with an immediate operand is only pairable if the destination is AL, AX, or EAX. TEST register,register and TEST register,memory is always pairable. Examples: TEST ECX,ECX ; pairable TEST [mem],EBX ; pairable TEST EDX,256 ; not pairable TEST DWORD PTR [EBX],8000H ; not pairable To make it pairable, use any of the following methods: MOV EAX,[EBX] / TEST EAX,8000H MOV EDX,[EBX] / AND EDX,8000H MOV AL,[EBX+1] / TEST AL,80H MOV AL,[EBX+1] / TEST AL,AL ; (result in sign flag) It is also possible to test a bit by shifting it into the carry flag: MOV EAX,[EBX] / SHR EAX,16 ; (result in carry flag) but this method has a penalty on the PentiumPro when the shift count is more than one. (The reason for this non-pairability is probably that the first byte of the 2-byte instruction is the same as for some other non-pairable instructions, and the Pentium cannot afford to check the second byte too when determining pairability.) 17.4 XCHG --------- The XCHG register,memory instruction is dangerous. By default this instruction has an implicit LOCK prefix which prevents it from using the cache. The instruction is therefore very time consuming, and should always be avoided. 17.5 rotates through carry -------------------------- RCR and RCL with a count different from one are slow and should be avoided. 17.6 string instructions ------------------------ String instructions without a repeat prefix are too slow, and should always be replaced by simpler instructions. The same applies to LOOP and JECXZ. String instructions with repeat may be optimal. Always use the dword version if possible, and make sure that both source and destination are aligned by 4. REP MOVSD is the fastest way to move blocks of data when the destination is in the level 1 cache. See section 19 for an alternative. Note that while the REP MOVSD instruction writes a DWORD to the destination, it reads the next DWORD from the source in the same clock cycle. According to rule 10.3 above, you have a cache bank conflict if bit 2-4 are the same in these two addresses. In other words, you will get a penalty of one clock extra per iteration if ESI+4-EDI is divisible by 32. This can be avoided if both source and destination are aligned by 8. REP STOSD is optimal when the destination is in the cache. REP LOADS, REP SCAS, and REP CMPS are not optimal, and may be replaced by loops. See section 16 example 10 for an alternative to REP SCASB. 17.7 bit scan ------------- BSF and BSR are the poorest optimized instructions on the Pentium, taking approximately 11 + 2*n clock cycles, where n is the number of zeros skipped. (on later processors it takes only 1 or 2) The following code emulates BSR ECX,EAX: TEST EAX,EAX JZ SHORT BS1 MOV DWORD PTR [TEMP],EAX MOV DWORD PTR [TEMP+4],0 FILD QWORD PTR [TEMP] FSTP QWORD PTR [TEMP] WAIT ; WAIT only needed for compatibility with earlier processors MOV ECX, DWORD PTR [TEMP+4] SHR ECX,20 SUB ECX,3FFH TEST EAX,EAX ; clear zero flag BS1: The following code emulates BSF ECX,EAX: TEST EAX,EAX JZ SHORT BS2 XOR ECX,ECX MOV DWORD PTR [TEMP+4],ECX SUB ECX,EAX AND EAX,ECX MOV DWORD PTR [TEMP],EAX FILD QWORD PTR [TEMP] FSTP QWORD PTR [TEMP] WAIT ; WAIT only needed for compatibility with earlier processors MOV ECX, DWORD PTR [TEMP+4] SHR ECX,20 SUB ECX,3FFH TEST EAX,EAX ; clear zero flag BS2: 17.8 bit test ------------- BT, BTC, BTR, and BTS instructions should preferably be replaced by instructions like TEST, AND, OR, XOR, or shifts. 17.9 integer multiplication --------------------------- An integer multiplication takes approximately 9 clock cycles. It is therefore advantageous to replace a multiplication by a constant with a combination of other instructions such as SHL, ADD, SUB, and LEA. Example: IMUL EAX,10 can be replaced with MOV EBX,EAX / ADD EAX,EAX / SHL EBX,3 / ADD EAX,EBX or LEA EAX,[EAX+4*EAX] / ADD EAX,EAX Floating point multiplication is faster than integer multiplication on a Pentium without MMX, but the time used to convert integers to float and convert the product back again is usually more than the time saved by using floating point multiplication, except when the number of conversions is low compared with the number of multiplications. 17.10 division -------------- Division is quite time consuming. The DIV instruction takes 17, 25, or 41 clock cycles for byte, word, and dword divisors respectively. The IDIV instruction takes 5 clock cycles more. It is therefore preferable to use the smallest operand size possible that won't generate an overflow, even if it costs an operand size prefix, and use unsigned division if possible. Unsigned division by a power of two can be done with SHR. Division of a signed number by a power of two can be done with SAR, but the result with SAR is rounded towards minus infinity, whereas the result with IDIV is truncated towards zero. Dividing a positive number N (< 2^16) with a constant D which is not a power of two can be done by multiplying N with 2^16 / D and then dividing with 2^16. For example, to divide a number with 5, you may multiply with 2^16 / 5 = 3333H, and then divide with 2^16: INC EAX / IMUL EAX,3333H / SHR EAX,16 The INC EAX is added to compensate for the double rounding error. This method works for 0 <= N <= 2^16. You have to test the code thoroughly to make sure the rounding is correct. If you want to replace the IMUL with faster pairable instructions, then you may write: LEA EBX,[EAX+2*EAX+3] LEA ECX,[EAX+2*EAX+3] SHL EBX,4 MOV EAX,ECX SHL ECX,8 ADD EAX,EBX SHL EBX,8 ADD EAX,ECX ADD EAX,EBX SHR EAX,16 You can use the same code to divide by 10. Just shift right by 17 instead of 16 in the end. Floating point division takes 39 clock cycles for the highest precision. You can save time by specifying a lower precision in the floating point control word (only FDIV and FIDIV are faster at low precision, no other instructions can be speeded up this way). It is possible to do a floating point division and an integer division in parallel to save time. Example: A = A1 / A2; B = B1 / B2 FILD [B1] FILD [B2] MOV EAX,[A1] MOV EBX,[A2] CDQ FDIV DIV EBX FISTP [B] MOV [A],EAX (make sure you set the floating point control word to the desired rounding method) Obviously, you should always try to minimize the number of divisions. Floating point division with a constant or repeated division with the same value should of course by done by multiplying with the reciprocal. But there are many other situations where you can reduce the number of divisions. For example: if (A/B > C)... can be rewritten as if (A > B*C)... when B is positive, and the opposite when B is negative. A/B + C/D can be rewritten as (A*D + C*B) / (B*D) If you are using integer division, then you should be aware that the rounding errors may be different when you rewrite the formulas. 17.11 WAIT ---------- You can often increase speed by omitting the WAIT instruction. The WAIT instruction has three functions: a. The 8087 processor requires a WAIT before _every_ floating point instruction to make sure the coprocessor is ready to receive it. b. WAIT is used to coordinate memory access between the floating point unit and the integer unit. Examples: b.1. FISTP [mem32] WAIT ; wait for floating point unit to write before.. MOV EAX,[mem32] ; reading the result with the integer unit b.2. FILD [mem32] WAIT ; wait for floating point unit to read value.. MOV [mem32],EAX ; before overwriting it with integer unit b.3. FLD QWORD PTR [ESP] WAIT ; prevent an accidental hardware interrupt from.. ADD ESP,8 ; overwriting value on stack before it is read c. WAIT is sometimes used to check for exceptions. It will generate an interrupt if an unmasked exception bit in the floating point status word has been set by a preceding floating point instruction. Regarding a: The function in point a is never needed on any other processors than the old 8087. Unless you want your code to be compatible with the 8087 you should tell your assembler not to put in these WAITs by specifying a higher processor. A 8087 floating point emulator also inserts WAIT instructions. You should therefore tell your assembler not to generate emulation code unless you need it. Regarding b: WAIT instructions to coordinate memory access are definitely needed on the 8087 and 80287 but not on the Pentium. It is not quite clear whether it is needed on the 80387 and 80486. I have made several tests on these Intel processors and not been able to provoke any error by omitting the WAIT on any 32 bit Intel processor, although Intel manuals say that the WAIT is needed for this purpose except after FNSTSW and FNSTCW. Omitting WAIT instructions for coordinating memory access is not 100 % safe, even when writing 32 bit code, because the code may be able to run on a 80386 main processor with a 287 coprocessor, which requires the WAIT. Also, I have not tested all possible hardware and software combinations, so there may be other situations where the WAIT is needed. If you want to be certain that your code will work on _any_ 32 bit processor (including non-Intel processors) then I would recommend that you include the WAIT here in order to be safe. Regarding c: The assembler automatically inserts a WAIT for this purpose before the following instructions: FCLEX, FINIT, FSAVE, FSTCW, FSTENV, FSTSW. You can omit the WAIT by writing FNCLEX, etc. My tests show that the WAIT is unneccessary in most cases because these instructions without WAIT will still generate an interrupt on exceptions except for FNCLEX and FNINIT on the 80387. (There is some inconsistency about whether the IRET from the interrupt points to the FN.. instruction or to the next instruction). Almost all other floating point instructions will also generate an interrupt if a previous floating point instruction has set an unmasked exception bit, so the exception is likely to be detected sooner or later anyway. You may insert a WAIT after the last floating point instruction in your program to be sure to catch all exceptions. You may still need the WAIT if you want to know exactly where an exception occurred in order to be able to recover from the situation. Consider, for example, the code under b.3 above: If you want to be able to recover from an exception generated by the FLD here, then you need the WAIT because an interrupt after ADD ESP,8 would overwrite the value to load. 17.12 FCOM + FSTSW AX --------------------- The usual way of doing floating point comparisons is: FLD [a] FCOMP [b] FSTSW AX SAHF JB ASmallerThanB You may improve this code by using FNSTSW AX rather than FSTSW AX and test AH directly rather than using the non-pairable SAHF. (TASM version 3.0 has a bug with the FNSTSW AX instruction) FLD [a] FCOMP [b] FNSTSW AX SHR AH,1 JC ASmallerThanB Testing for zero or equality: FTST FNSTSW AX AND AH,40H JNZ IsZero ; (the zero flag is inverted!) Test if greater: FLD [a] FCOMP [b] FNSTSW AX AND AH,41H JZ AGreaterThanB Do not use TEST AH,41H as it is not pairable. Do not use TEST EAX,4100H as it would produce a partial register stall on the PentiumPro. Do not test the flags after multibit shifts, as this has a penalty on the PentiumPro. The FNSTSW instruction delays for 4 clocks after any floating point instruction (probably because it is waiting for the status word to retire from the pipeline). If you can fill this latency with integer instructions between FCOM and FNSTSW, then the FNSTSW instruction will take only 2 clocks, rather than 6. Make sure you use the version without WAIT prefix in this case. It is some times faster to use integer instructions for comparing floating point values, as described in paragraph 18 below. 17.13 freeing floating point registers -------------------------------------- You have to free all used floating point registers before exiting a subroutine, except for any register used for the result. The fastest way to free one register is FSTP ST. The fastest way to free two registers is FCOMPP. It is not recommended to use FFREE. 17.14 FISTP ----------- Converting a floating point number to integer is normally done like this: FISTP DWORD PTR [TEMP] MOV EAX, [TEMP] An alternative method is: .DATA ALIGN 8 TEMP DQ ? MAGIC DD 59C00000H ; f.p. representation of 2^51 + 2^52 .CODE FADD [MAGIC] FSTP QWORD PTR [TEMP] MOV EAX, DWORD PTR [TEMP] Adding the 'magic number' of 2^51 + 2^52 has the effect that any integer between -2^31 and +2^31 will be aligned in the lower 32 bits when storing as a double precision floating point number. The result is the same as you get with FISTP for all rounding methods except truncation towards zero. The result is different from FISTP if the control word specifies truncation or in case of overflow. You may need a WAIT instruction for compatibility with other processors. See section 17.11. This method is not faster than using FISTP, but it gives better scheduling opportunities because there is a 3 clock void between FADD and FSTP which may be filled with other instrucions. You may multiply or divide the number by a power of 2 in the same operation by doing the opposite to the magic number. You may also add a constant by adding it to the magic number, which then has to be double precision. 17.15 FPTAN ----------- According to the manuals, FPTAN returns two values X and Y and leaves it to the programmer to divide Y with X to get the result, but in fact it always returns 1 in X so you can save the division. My tests show that on all 32 bit Intel processors with floating point unit or coprocessor, FPTAN always returns 1 in X regardless of the argument. If you want to be absolutely sure that your code will run correctly on all processors, then you may test if X is 1, which is faster than dividing with X. The Y value may be very high, but never infinity, so you don't have to test if Y contains a valid number if you know that the argument is valid. 18. USING INTEGER INSTRUCTIONS TO DO FLOATING POINT OPERATIONS ============================================================== Integer instructions are generally faster than floating point instructions, so it is often advantageous to use integer instructions for doing simple floating point operations. The most obvious example is moving data. Example: FLD QWORD PTR [ESI] / FSTP QWORD PTR [EDI] Change to: MOV EAX,[ESI] / MOV EBX,[ESI+4] / MOV [EDI],EAX / MOV [EDI+4],EBX The former code takes 4 clocks, the latter takes 2. Testing if a floating point value is zero: The floating point value of zero is usually represented as 32 or 64 bits of zero, but there is a pitfall here: The sign bit may be set! Minus zero is regarded as a valid floating point number, and the processor may actually generate a zero with the sign bit set if for example multiplying a negative number with zero. So if you want to test if a floating point number is zero, you should not test the sign bit. Example: FLD DWORD PTR [EBX] / FTST / FNSTSW AX / AND AH,40H / JNZ IsZero Use integer instructions instead, and shift out the sign bit: MOV EAX,[EBX] / ADD EAX,EAX / JZ IsZero The former code takes 9 clocks, the latter takes only 2. If the floating point number is double precision (QWORD) then you only have to test bit 32-62. If they are zero, then the lower half will also be zero if it is a normal floating point number. Testing if negative: A floating point number is negative if the sign bit is set and at least one other bit is set. Example: MOV EAX,[NumberToTest] / CMP EAX,80000000H / JA IsNegative Manipulating the sign bit: You can change the sign of a floating point number simply by flipping the sign bit. Example: XOR BYTE PTR [a] + (TYPE a) - 1, 80H Likewise you may get the absolute value of a floating point number by simply ANDing out the sign bit. Comparing numbers: Floating point numbers are stored in a unique format which allows you to use integer instructions for comparing floating point numbers, except for the sign bit. If you are certain that two floating point numbers both are normal and positive then you may simply compare them as integers. Example: FLD [a] / FCOMP [b] / FNSTSW AX / AND AH,1 / JNZ ASmallerThanB Change to: MOV EAX,[a] / MOV EBX,[b] / CMP EAX,EBX / JB ASmallerThanB This method only works if the two numbers have the same precision and you are certain that none of the numbers have the sign bit set. If negative numbers are possible, then you have to convert the negative numbers to 2-complement, and do a signed compare: MOV EAX,[a] MOV EBX,[b] MOV ECX,EAX MOV EDX,EBX SAR ECX,31 ; copy sign bit AND EAX,7FFFFFFFH ; remove sign bit SAR EDX,31 AND EBX,7FFFFFFFH XOR EAX,ECX ; make 2-complement if sign bit was set XOR EBX,EDX SUB EAX,ECX SUB EBX,EDX CMP EAX,EBX JL ASmallerThanB ; signed comparison This method works for all normal floating point numbers, including -0. 19. USING FLOATING POINT INSTRUCTIONS TO DO INTEGER OPERATIONS ============================================================== 19.1 Moving data ---------------- Floating point instructions can be used to move 8 bytes at a time: FILD QWORD PTR [ESI] / FISTP QWORD PTR [EDI] This is only an advantage if the destination is not in the cache. The optimal way to move a block of data to uncached memory on the Pentium is: TopOfLoop: FILD QWORD PTR [ESI] FILD QWORD PTR [ESI+8] FXCH FISTP QWORD PTR [EDI] FISTP QWORD PTR [EDI+8] ADD ESI,16 ADD EDI,16 DEC ECX JNZ TopOfLoop The source and destination should of course be aligned by 8. The extra time used by the slow FILD and FISTP instructions is compensated for by the fact that you only have to do half as many write operations. Note that this method is only advantageous on the Pentium and only if the destination is not in the level 1 cache. On all other processors the optimal way to move blocks of data is REP MOVSD, or if you have a processor with MMX you may use the MMX instructions instead to write 8 bytes at a time. 19.2 Integer multiplication --------------------------- Floating point multiplication is faster than integer multiplication on the Pentium without MMX, but the price for converting integer factors to float and converting the result back to integer is high, so floating point multiplication is only advantageous if the number of conversions needed is low compared with the number of multiplications. Integer multiplication is faster than floating point on other processors. It may be tempting to use denormal floating point operands to save some of the conversions here, but the handling of denormals is very slow, so this is not a good idea! 19.3 Integer division --------------------- Floating point division is not faster than integer division, but you can do other integer operations (including integer division, but not integer multiplication) while the floating point unit is working on the division. See paragraph 17.10 above for an example. 19.4 Converting binary to decimal numbers ----------------------------------------- Using the FBSTP instruction is a simple and convenient way of converting a binary number to decimal, although not necessarily the fastest method. 20. LIST OF INTEGER INSTRUCTIONS ================================ Explanations: Operands: r=register, m=memory, i=immediate data, sr=segment register m32= 32 bit memory operand, etc. Clock cycles: The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Pairability: u=pairable in U-pipe, v=pairable in V-pipe, uv=pairable in either pipe, np=not pairable Opcode Operands Clock cycles Pairability ---------------------------------------------------------------------------- NOP 1 uv MOV r/m, r/m/i 1 uv MOV r/m, sr 1 np MOV sr , r/m >= 2 b) np MOV m , accum 1 uv h) XCHG (E)AX, r 2 np XCHG r , r 3 np XCHG r , m >20 np XLAT 4 np PUSH r/i 1 uv POP r 1 uv PUSH m 2 np POP m 3 np PUSH sr 1 b) np POP sr >= 3 b) np PUSHF 4 np POPF 6 np PUSHA POPA 5 np LAHF SAHF 2 np MOVSX MOVZX r, r/m 3 a) np LEA r/m 1 uv LDS LES LFS LGS LSS m 4 c) np ADD SUB AND OR XOR r , r/i 1 uv ADD SUB AND OR XOR r , m 2 uv ADD SUB AND OR XOR m , r/i 3 uv ADC SBB r , r/i 1 u ADC SBB r , m 2 u ADC SBB m , r/i 3 u CMP r , r/i 1 uv CMP m , r/i 2 uv TEST r , r 1 uv TEST m , r 2 uv TEST r , i 1 f) TEST m , i 2 np INC DEC r 1 uv INC DEC m 3 uv NEG NOT r/m 1/3 np MUL IMUL r8/r16/m8/m16 11 np MUL IMUL all other versions 9 d) np DIV r8/r16/r32 17/25/41 np IDIV r8/r16/r32 22/30/46 np CBW CWDE 3 np CWD CDQ 2 np SHR SHL SAR SAL r , i 1 u SHR SHL SAR SAL m , i 3 u SHR SHL SAR SAL r/m, CL 4/5 np ROR ROL RCR RCL r/m, 1 1/3 u ROR ROL r/m, i(><1) 1/3 np ROR ROL r/m, CL 4/5 np RCR RCL r/m, i(><1) 8/10 np RCR RCL r/m, CL 7/9 np SHLD SHRD r, i/CL 4 a) np SHLD SHRD m, i/CL 5 a) np BT r, r/i 4 a) np BT m, i 4 a) np BT m, r 9 a) np BTR BTS BTC r, r/i 7 a) np BTR BTS BTC m, i 8 a) np BTR BTS BTC m, r 14 a) np BSF BSR r , r/m 7-73 a) np SETcc r/m 1/2 a) np JMP CALL short/near 1/4 e) v JMP CALL far >= 3 e) np conditional jump short/near 1/4/5 e) v CALL JMP r/m 2/5 e) np RETN 2/5 e) np RETN i 3/6 e) np RETF 4/7 e) np RETF i 5/8 e) np J(E)CXZ short 5-8 e) np LOOP short 5-9 e) np BOUND r , m 8 np CLC STC CMC CLD STD 2 np CLI STI 6-7 np LODS 2 np REP LODS 7+3*n g) np STOS 3 np REP STOS 10+n g) np MOVS 4 np REP MOVSB 12+1.8*n g) np REP MOVSW 12+1.5*n g) np REP MOVSD 12+n g) np SCAS 4 np REP(N)E SCAS 9+4*n g) np CMPS 5 np REP(N)E CMPS 8+5*n g) np BSWAP 1 a) np CPUID 13/15/16 a) np RDTSC 6 a) np ---------------------------------------------------------------------------- Notes: a) this instruction has a 0FH prefix which takes one clock cycle extra to decode on a Pentium without MMX unless preceded by a multicycle instruction (see section 13 above). b) versions with FS and GS have a 0FH prefix. see note a. c) versions with SS, FS, and GS have a 0FH prefix. see note a. d) versions with two operands and no immediate have a 0FH prefix, see note a. e) see section 12 above f) only pairable if register is accumulator. see paragraph 17.3 above g) add one clock cycle for decoding the repeat prefix unless preceded by a multicycle instruction (such as CLD. see section 13 above). h) pairs as if it were writing to the accumulator. see paragraph 17.2 above. 21. LIST OF FLOATING POINT INSTRUCTIONS ======================================= Explanations: Operands: r=register, m=memory, m32=32 bit memory operand, etc. Clock cycles: The numbers are minimum values. Cache misses, misalignment, denormal operands, and exceptions may increase the clock counts considerably. Pairability: +=pairable with FXCH, np=not pairable with FXCH. i-ov: Overlap with integer instructions. i-ov = 4 means that the last four clock cycles can overlap with subsequent integer instructions. fp-ov: Overlap with floating point instructions. fp-ov = 2 means that the last two clock cycles can overlap with subsequent floating point instructions. (WAIT is considered a floating point instruction here) Opcode Operand Clock cycles Pairability i-ov fp-ov ----------------------------------------------------------------------------- FLD r/m32/m64 1 + 0 0 FLD m80 3 np 0 0 FBLD m80 48-58 np 0 0 FST(P) r 1 np 0 0 FST(P) m32/m64 2 i) np 0 0 FST(P) m80 3 i) np 0 0 FBSTP m80 148-154 np 0 0 FILD m 3 np 2 2 FIST(P) m 6 np 0 0 FLDZ FLD1 2 np 0 0 FLDPI FLDL2E etc. 5 np 0 0 FNSTSW AX/m16 6 n) np 0 0 FLDCW m16 8 np 0 0 FNSTCW m16 2 np 0 0 FADD(P) r/m 3 + 2 2 FSUB(R)(P) r/m 3 + 2 2 FMUL(P) r/m 3 + 2 2 j) FDIV(R)(P) r/m 19/33/39 m) + k) 38 l) 2 FCHS FABS 1 + 0 0 FCOM(P)(P) FUCOM r/m 1 + 0 0 FIADD FISUB(R) m 6 np 2 2 FIMUL m 6 np 2 2 FIDIV(R) m 22/36/42 m) np 38 l) 2 FICOM m 4 np 0 0 FTST 1 np 0 0 FXAM 17-21 np 4 0 FPREM 16-64 np 2 2 FPREM1 20-70 np 2 2 FRNDINT 9-20 np 0 0 FSCALE 20-32 np 5 0 FXTRACT 12-66 np 0 0 FSQRT 70 np 69 l) 2 FSIN FCOS 16-126 np 2 2 FSINCOS 17-137 np 2 2 F2XM1 13-57 np 2 2 FYL2X 22-111 np 2 2 FYL2XP1 22-103 np 2 2 FPATAN 19-134 np 2 2 FPTAN 17-173 np 36 l) 0 FNOP 2 np 0 0 FXCH r 1 np 0 0 FINCSTP FDECSTP 2 np 0 0 FFREE r 2 np 0 0 FNCLEX 6-9 np 0 0 FNINIT 12-22 np 0 0 FNSAVE m 124-300 np 0 0 FRSTOR m 70-95 np 0 0 WAIT 1 np 0 0 ----------------------------------------------------------------------------- Notes: i) The value to store is needed one clock cycle in advance. j) 1 if the overlapping instruction is also an FMUL. k) If the FXCH is followed by an integer instruction then it will pair imperfectly and take an extra clock cycle so that the integer instruction will begin in clock cycle 3. l) Cannot overlap integer multiplication instructions. m) FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9 of the floating point control word. n) The first 4 clock cycles can overlap with preceding integer instructions. See section 17.12 above. 22. TESTING SPEED ================= The Pentium has an internal 64 bit clock counter which can be read into EDX:EAX using the instruction RDTSC (read time stamp counter). This is very useful for testing exactly how many clock cycles a piece of code takes. The program below is useful for measuring the number of clock cycles a piece of code takes. The program executes the code to test 10 times and stores the 10 clock counts. The program can be used in both 16 and 32 bit mode. RDTSC MACRO ; define RDTSC instruction DB 0FH,31H ENDM ITER EQU 10 ; number of iterations .DATA ; data segment ALIGN 4 COUNTER DD 0 ; loop counter TICS DD 0 ; temporary storage of clock RESULTLIST DD ITER DUP (0) ; list of test results .CODE ; code segment BEGIN: MOV [COUNTER],0 ; reset loop counter TESTLOOP: ; test loop ;**************** Do any initializations here: ************************ FINIT ;**************** End of initializations ************************ RDTSC ; read clock counter MOV [TICS],EAX ; save count CLD ; non-pairable filler REPT 8 NOP ; eight NOP's to avoid shadowing effect ENDM ;**************** Put instructions to test here: ************************ FLDPI ; this is only an example FSQRT RCR EBX,10 FSTP ST ;********************* End of instructions to test ************************ CLC ; non-pairable filler with shadow RDTSC ; read counter again SUB EAX,[TICS] ; compute difference SUB EAX,15 ; subtract the clock cycles used by fillers etc MOV EDX,[COUNTER] ; loop counter MOV [RESULTLIST][EDX],EAX ; store result in table ADD EDX,TYPE RESULTLIST ; increment counter MOV [COUNTER],EDX ; store counter CMP EDX,ITER * (TYPE RESULTLIST) JB TESTLOOP ; repeat ITER times ; insert here code to read out the values in RESULTLIST The 'filler' instructions before and after the piece of code to test are critical. The CLD is a non-pairable instruction which has been inserted to make sure the pairing is the same the first time as the subsequent times. The eight NOP instructions are inserted to prevent any prefixes in the code to test to be decoded in the shadow of the preceding instructions. Single byte instructions are used here to obtain the same pairing the first time as the subsequent times. The CLC after the code to test is a non-pairable instruction which has a shadow under which the 0FH prefix of the RDTSC can be decoded so that it is independent of any shadowing effect from the code to test. On the PentiumPro you have to put in a serializing instruction like CPUID before and after each RDTSC to prevent it from executing in parallel with anything else. CPUID has a decoding delay on the Pentium without MMX because of the 0FH prefix, but it has no shadow because it is serializing. The RDTSC instruction cannot execute in virtual mode, so if you are running under DOS you must skip the EMM386 (or any other memory manager) in your CONFIG.SYS and not run under a DOS box in Windows. The Pentium processor has special performance monitor counters which can count events such as cache misses, misalignments, AGI stalls, etc. Details about how to use the performance monitor counters are not covered by this manual and must be sought elsewhere (see chapter 2). 23. CONSIDERATIONS FOR OTHER MICROPROCESSORS ============================================ Most of the optimations described in this document have little or no negative effects on other microprocessors, including non-Intel processors, but there are some problems to be aware of. Using a full register after writing to part of the register will cause a moderate delay on the 80486 and a severe delay on the PentiumPro and PentiumII. This is called a partial register stall. Example: MOV AL,[EBX] / MOV ECX,EAX You may avoid this penalty by zeroing the full register first: XOR EAX,EAX / MOV AL,[EBX] / MOV ECX,EAX or by using MOVZX EAX,BYTE PTR [EBX] A similar penalty occurs when you address the same piece of memory with different data sizes. Example: MOV [ESI],EAX / MOV BL,[ESI] Scheduling floating point code for the Pentium often requires a lot of extra FXCH instructions. This will slow down execution on earlier microprocessors, but not on the PentiumPro and advanced non-Intel processors. The MMX instructions available on the PentiumMMX and PentiumII processors are very useful for massively parallel integer calculations. The PentiumPro chip is faster than the Pentium in some respects, but inferior in other respects. Knowing the strong and weak sides of the PentiumPro can help you make your code work well on both processors. The most important advantage of the PentiumPro is that it does much of the optimation for you: reordering instructions and splitting complex instructions into simple ones. But for perfectly optimized code there is less difference between the two processors. The two processors have basically the same number of execution units, so the throughput should be near the same. The PPro has separate units for memory read and write so that it can do three operations simultaneously if one of them is a memory read, but on the other hand it cannot do two memory reads or two writes simultaneously as the Pentium can. The PPro is better than the Pentium in the following respects: - out of order execution - splitting complex instructions into smaller micro-ops - one cache miss does not delay subsequent independent instructions - automatic register renaming to avoid unnecessary dependencies - better jump prediction method than Pentium without MMX - many instructions which are unpairable and poorly optimized on the Pentium perform better on the PPro, f.ex. integer multiplication, movzx, cdq, bit scan, bit test, shifts by cl, and floating point store - floating point instructions and simple integer instructions can execute simultaneously - memory reads and writes do not occupy the ALU's - indirect memory read instructions have no AGI stall - new conditional move instructions can be used instead of branches in some cases - new FCOMI instruction eliminates the need for the slow FNSTSW AX The PPro is inferior to the Pentium in the following respects: - mispredicted jumps are very expensive (10-15 clock cycles!) - poor performance on 16 bit code and segmented models - prefixes are expensive (except 0FH extended opcode) - long stall when mixing 8, 16, and 32 bit registers - fadd, fsub, fmul, fchs have longer latency - cannot do two memory reads or two memory writes simultaneously - some instruction combinations cannot execute in parallel, like push+push, push+call, compare+conditional jump As a consequence of this, the Pentium Pro may actually be slower than the Pentium on perfectly optimized code with a lot of unpredictable branches, and a lot of floating point code with little or no natural parallelism. In most other cases the PPro is faster. Most of the drawbacks of each processor can be circumvented by careful optimation and running 32 bit flat mode. But the problem with mispredicted jumps on the PPro cannot be avoided except in the cases where you can use a conditional move instead. Taking advantage of the new instructions in the MMX and PentiumPro processors will create problems if you want your code to be compatible with earlier microprocessors. The solution may be to write several versions of your code, each optimized for a particular processor. Your program should automatically detect which processor it is running on and select the appropriate version of code. Such a complicated approach is of course only needed for the most critical parts of your program.